R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Course / Welcome / Basic Information About This MOOC

README

Pre-Course Survey

Course / Module 1: MapReduce / Outline of Module 1

README

http://www.mmds.org/ http://infolab.stanford.edu/~ullman/mmds/book0n.pdf

Course / Module 1: MapReduce / MapReduce

1. Distributed File Systems (15:50)

Welcome to Mining Massive Datasets. I’m Anand Rajaraman and today’s topic is Map-Reduce. In the last few years Map-Reduce has emerged as a leading paradigm for mining really massive data sets. But before we get into Map-Reduce proper, let’s spend a few minutes trying to understand why we need Map-Reduce in the first place. Let’s start with the basics. Now we’re all familiar with the basic computational model of CPU and memory, right? The algorithm runs on the CPU, and accesses data that’s in memory. Now we may need to bring the data in from disk into memory, but once the data is in memory, fits in there fully. So you don’t need to access disk again, and the algorithm just runs in the data that’s on memory. Now there’s a familiar model that we use to implement all kinds of algorithms, and machined learning, and statistics. And pretty much everything else. All right? Now, what happened to the data is so big, that it can’t all fit in memory at the same time. That’s where data mining comes in. And classical data mining algorithms. Look at the disk in addition to looking at CPU and memory. So the data’s on disk, you can only bring in a portion of the data into memory at a time. And you can process it in batches, and you know, write back results to disk. And this is the realm of classical data mining algorithms. But sometimes even this is not sufficient. Let’s look at an example. So think about Google, crawling and indexing the web, right? Let’s say, google has crawled 10 billion web pages. And let’s further say, that the average size of a web page is 20 KB. Now, these are representative numbers from real life. Now if you take ten billion webpages, each of 20 KB, you have, total data set size of 200 TB. Now, when you have 200 TB, let’s assume that they’re using the classical computational model, classical data mining model. And all this data is stored on a single disk, and we have read tend to be processed inside a CPU. Now the fundamental limitation here is the bandwidth, the data bandwidth between the disk and the CPU. The data has to be read from the disk into the CPU, and the disk read bandwidth for most modern SATA disk representative number. Is around 50MB a second. So, so we can read data at 50MB a second. How long does it take to read 200TB at 50MB a second? Can do some simple math, and the answer is 4 million seconds which is more than 46 days. Remember, this is an awfully long time, and is just the time to read the data into memory. To do something useful with the data, it’s going to take even longer. Right, so clearly this is unacceptable. You can’t take four to six days just to read the data. So you need a better solution. Now the obvious thing that you think of is that it can split the data into chunks. And you can have multiple disks and CPUs. you, you stripe the data across multiple disks. And you can read it, and, and process it in parallel in multiple CPUs. That will cut down, this time by a lot. For example, if you had a 1,000 disks and CPUs, in four thousa-, 4 million seconds. And we were completely in parallel, in 4 million seconds, you could do the job in, 4 million by 1,000, which is 4,000 seconds. And that’s just about an hour which is, which is very acceptable time. Right? So this is the fundamental idea behind the idea of cluster computing. Right? And this is, this tiered architecture that has emerged for cluster computing is something like this. You have the racks consisting of commodity Linux nodes. As you go with commodity Linux nodes because they are very cheap. And you can, you can buy thousands and thousands of them and, and rack them up. you, you have many of these racks. Each rack has 16 to 64 of these commodity Linux nodes and these nodes are connected by a switch. and, the, the, the switch in a rack is typically a gigabit switch. So there’s 1 Gbps bandwidth between any pair of nodes in rack. Of course 16 to 64 nodes is not sufficient. So you have multiple racks, and all the, the racks themselves are connected by backbone switches. And the backbones is, is a higher bandwidth switch can do two to ten gigabits between racks. Right? So so we have 16 to 64 nodes in a rack. And then you, you rack up multiple racks, and, and you get a data center. So this is the standard classical architecture that has emerged over the last few years. For you know, for storing and mining very large data sets. Now once you have this kind of cluster this doesn’t solve the problem completely. Because cluster computing comes with it’s own challenges. But before we get there, let’s get us, you know, ideal of the scale, right? In 2011 somebody estimated that Google had a million machines, million nodes like this. In stacked up you know, is, is somewhat like this. So, so it gives, so that gives you a sense of the scale of modern data centers and, and, and clusters, right? So here’s, here’s a picture. This is what, it looks like inside a data center. So the, the, what you see there is, is the back up racks, and you can see the connections, between, between the racks. Now, once you have such a big cluster, you actually have to do computations on the cluster. Right? And clustered computing comes with its own, challenges. The first and the most major challenge is that nodes can fail. Right? Now a single, node doesn’t fail that often. Right? If you, if you just connect, the next node and let it stay up, it can probably stay up for, three years without failing. Three years is about a 1,000 days. So that’s, you know, once in a 1,000 days failure isn’t such a big deal. But now imagine that you have a 1,000 servers in a cluster. And in your, and if you assume that these, servers fail, independent of each other. You’re going to get approximately one failure a day. Which is, still isn’t such a big deal. You can probably deal with it. But now imagine something on the scale of Google which has a million servers, in its cluster. So if you have a million servers, you’re going to get a 1,000 failures per day. Now a 1,000 failures per day is a lot and you need some kind of infrastructure to deal with that kind of failure rate. Your failures on that scale introduce two kinds of problems. The first problem is that if, you know, if nodes are going to fail and you’re going to store your data on these nodes. How do you keep the data and store persistently? What does this mean? Persistence means that once you store the data, you’re guaranteed you can read it again. But if the node in which you stored the data fails, then you can’t read the data. You might even lose the data. So how do you keep the data stored persistently if like, these nodes can fail. Now the second problem is is is one of availability. So, let’s say you’re running one of the computations, and this computation is, a, you know, analyzing massive amounts of data. And it’s chugging through the computation and it’s going, you know, run half way through the computation. And, you know, at this critical point, a couple of nodes fail, right? And that node had data that is necessary for the computation. Now how we deal with this problem. Now in the first place you may have to go back and restart the computation all over again. But if you restart it now and, and, and the computation turns again when the computation is running. So kind of need an infrastructure that can hide these kinds of node failures and let the computation go to go to completion even if nodes fail. The second challenge of cluster computing is that the network itself can become a bottleneck. Now remember, there is this 1 Gbps network bandwidth. That is available between individual nodes in a rack and a smaller bandwidth that’s available between individual racks. Though if you have 10 TB of data, and you have to move it across a 1 Gbps network connection, that takes approximately a day. You can do the math and figure that out. You know a complex computation might need to move a lot of data, and that can slow the computation down. So you need a framework that you know, doesn’t move data around so much while it’s doing computation. The third problem is that distributed programming can be really really hard. Even sophisticated programmers find it hard to write distributed programs correctly and avoid race conditions and various kinds of complications. So here’s a simple problem that hides most of the complexity of distributed programming. And, and makes it easy to write you know, algorithms that can mine very massive data sets. So we look at three problems that you know that we face when, when we’re dealing with cluster computing. And, Map-Reduce addresses all three of these challenges. Right? First of all, the first problem that we saw was that, was one of persistence and availability of nodes can fade. The Map-Reduce model addresses this problem by storing data redundantly on multiple nodes. The same data is stored on multiple nodes so that even if you lose one of those nodes, the data is still available on another node. The second problem that we saw was one of network bottlenecks. And this happens when you move around data a lot. What the Map-Reduce model does is it moves the computation close to the data. And avoids copying data around the network. And this minimizes the network bottle neck problem. And thirdly, the Map-Reduce model also provides a very simple programming model that hides the complexity of all the online magic. So let’s look at each of these pieces in turn. The first piece is the redundant storage infrastructure. Now redundant storage is provided by what’s called a distributed file system. Now distributed file system is a file system that stores data you know, across a cluster, but stores each piece of data multiple times. So, the distributed file system provides a global file namespace. It provides redundancy and availability. There are multiple implementations of distributed file systems. Google’s GFS is or Google File System, or GFS is one example. Hadoop’s HDFS is another example. And these are the two most popular distributed file systems out there. Our typical usage pattern that these distributed file systems are optimized for is huge files. That are in the 100s to, of GB to TB. But the, even though the files are really huge, the data is very rarely updated in place. Right, once, once data is written you know it’s, it’s very, very often. But when it’s updated, it’s updated through appends. It’s never updated in place. And for example let, let, imagine the Google scenario once again. When Google encounters a new webpage it, it adds the webpage to a depository. Doesn’t ever go and update the content of the webpage that it already has crawled, right? So a typical usage pattern consists of writing the data once, reading it multiple times and appending to it occasionally. Lets go into the hood of a distributed file system to see how it actually works. Data is kept in chunks that are spread across machines. So if you take any file, the file is divided into chunks, and these chunks are spread across multiple machines. So the machines themselves are called chunk servers in this context. So here’s, here’s an example. There are multiple multiple chunks servers. Chunk server 1, 2, 3, and 4. And here’s the file 1. And file 1 is divided into six chunks in this case, C0, C1, C2, C3, C4 and C5. And these chunks as you can see four of the chunks happen to be on Chunk server 1. One of them is on Chunks server 2 and, one of them is on Chunks server 3. Now this is not sufficient. You actually have to store multiple copies of each of these chunks and so we replicate these chunks so here copy, here is a copy of C1. On Chunk server 2, a copy of C2 in Chunk server 3, and so on. So each chunk, in this case is replicated twice. And if you notice carefully you’ll see that replicas of a chunk are never on the same chunk server. They’re always on different chunks of, so C1 has one replica on Chunk server 1 and one on Chunk server 2. C0 has one on Chunk server 1, and one on Chunk server N, and so on. And here is here is another file, D. D has two chunks, D0 and D1. And that’s replicated twice. And so and so that’s stored on different chunks server [INAUDIBLE]. Now so, so you serve you serve from chunk files and store them on, on these, on these chunk servers. Now we turn some of the chunk servers, also act as compute servers. And when, whenever your computation has to access data. That computation is actually scheduled on the chunk server that actually contains the data. This way you avoid moving data to where the computation needs to run, but instead you move the computation to where the data is. And that’s how you put a wide under the city data movement in the system. This isn’t clear when you look at look at some examples. So the sum of this, each file is split into contiguous chunks. And the chunks are typically 16 to 64 MB in in size. On each chunk is replicated, in our example we saw each chunk replicated twice. But it could be 2x or 3x replication. 3x is the most common. And we saw that the chunks were actually kept on different chunk servers. But, but when you replicate 3x, you know, the system usually makes an effort. To keep at least one replica in a entirely different rack if possible and why do we do that? We do that because it’s you know, the most common scenario is that a single node can fail. But it’s also possible that the switch on a rack can fail, and when the switch on a rack fails, the entire rack becomes inaccessible. And then if you have all the chunks for a, for in all the replicas of a chunk in one rack then that whole chunk can become inaccessible. So if you keep replicas of a chunk on different racks then even if a switch fails then it can still access that chunk. Right so the system tries to make sure that, that the replicas of a chunk are actually kept on different racks. The second component of a distributed file system is, is a master node. Now the master node is also known as the, it’s called a master node in the Google file system, it’s a called a Name Node in Hadoop’s HDFS. But the master node stores metadata about where the files are stored. And for example, if my you know, it’ll know that file one is divided into six chunks. And here is, here are the locations of each of the six chunks, and here are the locations of the replicas. And the master node itself may be replicated because otherwise it might become a single point of failure. The final component of a distributed file system is a client library. Now, when the, when a client, or, or an algorithm that needs to access the data tries to access a file it goes through the client library. The client library talks to the master and finds the chunk servers that actually store the chunks. And once that’s done the client is directly connected to the chunk servers. Where it can access the data without going through the master nodes. So the data access actually happens in peer-to-peer fashion without going through the master node

2. The MapReduce Computational Model (22:04)

Welcome back to Mining of Massive Datasets. We’re going to continue our lecture on MapReduce, and take a look on the MapReduce computational model. So before we look at the actual MapReduce Programming Model, let’s do a warm up task. Now imagine you have a huge text document you know maybe tera, terabytes long and you want to count, the number of times each distinct word appears in the file. For example, we want to find out that the word the appears 10 million times and the word you know apple appears 433 times. Right? And some sample applications of this kind of toy example in real life are you know if you have a big depth of a log and you want to find out, how often each URL is accessed that could be a sample application. Or it maybe building terms, such as text for a search engine. Right? So, but for now let’s just imagine that we have this one big file. That’s a huge text document and our task is to count, the number of times each distinct word appears in that file. So, let’s look at two cases, the first case is that the file itself is too large for memory. Because remember we said it’s a, it’s a, big, big file. But imagine that there are, is few enough words in it so that all the word count pairs actually fit in memory, right? How do you solve the problem in this case? Well it turns out, that in this case a very simple approach works. You can just build a, a Hash Table. I’ll, build the, the index by word. And and the Hash Table for each word will of course, will restore the count, of the number of times, that word appears. So you the first time you see a word you initialize you know, you add an entry to the Hash Table, with that word, and set the count to 1. And every subsequent time you see the word, you, you increment the count by one. And you, you make a single sweep through the file. And at the end of that, you have the word count pairs for every unique word that appears in the file. So this is a simple program, that all of us have written, you know, many many times in some context or the other. Now, let’s make it a little bit more complicated. Let’s let’s imagine that even the word, count pairs don’t fit in memory. Right, the file’s too big it doesn’t fit in memory, but, there’s so many words, distinct words in the file that even, you can’t even hold all the distinct words in memory. Right? Now how do you go about solving the problem in this case? Well you can try to write some kind of complicated code but you know, I’m lazy so I like to use Unix file system commands to do this. And so here’s how I would go about doing this So this is a a Unix command line way of, of doing this. You know here the the command you know words is, is, is a little script that goes through doc.txt which is the, which is the big text file. And it outputs the words in it one per line. And once once those words are output I can pipe them to to a sort. And the sort sorts the you know, sorts the output of that. And once you sort it all of the all occurrences of the same word come together. And once you do that you can pipe it to another little handy utility called uniq and one of the one of the nifty features of uniq is the, my, is the dash c option. And when you do uniq dash c what uniq dash c does is it takes a run of the occurrence of the same word. And then just counts the occurrences of the same word. So the output of this is going to be word count pairs, right? So and you know I, I’m sure many of you have done something like this. And if you’ve done something like this, you’ve actually done something that’s like MapReduce. Right? So this case actually captures the essence of MapReduce. And the nice thing about this kind of implementation is that it’s, it’s very, very naturally paralle, light, parallelizable as we’ll see in a, in a moment. So so let’s look at an old view of MapReduce using this example, right? So the the first step that we did. What we, we took the document which was our input. And we wrote a script called words, that output one word to a line, right? And this is what’s called a Map function in in, in, in MapReduce. The Map function scans the input file record-at-a-time. And for each record, it, it pulls out something that you care about. In this case, it was words. and, and the thing that the you output for each record you can, you, you cannot one or multiple things for each records. And the things that you output, are called keys, okay? The second step is is to group by key. And this is the what the, the sort step was doing. It grouped all the keys with the same value together. Right? And the the third step the is the, the unique minus c step. That’s the reduce piece of MapReduce. And once the reducer looks at all the key you know, all the keys with the same value, and then it, then it ru, runs some kind of function. In this case it counted the number of times the each key occurred but, but it could be something much more complicated. And once it does that kind of analysis it, it has an answer which, which it then writes up. Okay, so this is MapReduce in a, in a nutshell. Now the, the outline of this computation actually stays the stame, same for any MapReduce computation. What changes is it that it change the Map function, the Reduce function, to the fit the problem that you’re actually solving. Right? In this case for the word count the Map and the Reduce function were quite simple. In some other problems the Map and the Reduce functions might be more complicated. Here’s here’s, here’s another way of looking at it. You start with a, with a bunch of key value pairs. And so here’s k k v k stands for key and v stands for value. And the, the, the, the, the, the Map step. Takes the key-value pairs and maps them to intermediate key-value pairs. Okay? So for example, you run the Map on the first key-value pair pair here at k v and it it actually outputs two intermediate key-value pairs. And the, the intermediate key-value pairs need not have the same key, as input key value-pairs. They could be different keys. And there could be multiple of them. And the values although they look the same here, they, they both say v, the values could be different as well. And and notice in this case we started with the one input key-value pair, and the Map function produced multiple intermediate key-value pairs. So there can be zero, one, or multiple intermediate key-value pairs, for each, input key-value pair. Now let’s do it again, for the second key-value pair. Let’s apply the Map function and it turns out that in this case we have the one key-value pair in the, the in the intermediate key-value pair, in the output. And so on. So, so, we, we run through the entire input file. Apply the Map function to each input record. And create intermediate key-value pairs. Now the next step, is to take these intermediate key-value pairs, and group them by key. Right? So, all the intermediate key-value pairs that have the same key, are grouped together. So it turns out that there are three values. With the, with the first key, two values for the second key and so on. And they all get grouped together, and this is done by sorting by key and then by grouping together the value of, you know the values for the same key. And these are all different values, although I use the same same symbol v here. Now, once you have once you have these key value groups then the final step is the reducer. The reducer takes a look at a, a single a single key-value group as input and. It produces produces an output that has the sa, you know, that has the same key but it combines the the, the, the, the values or the values for a given key, into a single value. For example, it could add up all the values. In the, or, or it could or it could multiply them, or it could do, it could take the average. Or it can do something more complicated. But with all of the values for a given key. And finally you, the output, it outputs a single value for the key. Right? And so, when you, when you apply the reducer to the second key-value group. You get, you get another output and so on. And once you apply the reducer to all the intermediate key-value groups. You get the final output. So more formally the input to MapReduce is a set of key-value pairs. And the programmer has to specify two methods. The first method is a Map method. And the Map method takes an input key-value pair. And produces an int, an set of intermediate key-value pair, zero or more intermediate key-value pairs. and, there is one Map call, for every input key-value pairs. The Reduce function, takes an intermediate key-value group the intermediate key-value group consists of a key, and a set of values for that key. and, the output can consist of one, zero, one, or multiple key-value pairs once again. The key is the same as the as the input key but the value is, is, is is obtained by combining, the input values in some manner. For example, you might add up the you know, add up the input values and that could be the output v double prime here. So let’s look at the the word count example and run that through the MapReduce process again. Here’s our big document. And I hope you can see the text of this you know, the document but it doesn’t matter, you can see that there are words in there. And so we’re going to take this big document. And we’re going to take the Map function that’s provided by the programmer. The Map function reads the input, and produces a, produces a set of key-value pairs, and the key-value pairs in this case are going to be the key. Each word is going to be a key, and the value is going to be the number 1. Right? so, for example, the word the and 1 crew and 1 and so on, and the word the appears again. And so there, there’s another the, 1 here and so on. So these are the intermediate key-value pairs, that are produced by the Map function. [SOUND] Now the next step is the group by key step which collects together all pairs with the same key. So we can see that the, there are two tuples two intermediate tuples with the, with the key crew and then those are collected together here. There’s one with you know, with the word space, there are three with the word the, and so on. And they’re all sorted and collected together. In this yeah, in, in this place here. And the, the final step is the Reduce step. The Reduce, the Reduce step collects together all the values so the Reduce step adds, adds together the 2, 1 from crew. and, and figures out that there are two you know, two occurrences of the word crew. Space has 1. There are 3 tuples with the 1 for the there all added together. And the output is 3, and so on. Right, so this is a schematic, of the the MapReduce word counting example. now, of course this, this whole example doesn’t run on a single node. The data is actually distributed across multiple input nodes. So let’s take that into account. And see here’s here’s the data. The data’s actually divided here into, into multiple nodes. Let’s say the, the red the, the, the first portion of, of the file is it’s chunk one, and it’s on one node. The second portion of the file here is chunk two, which is on a different node. The third portion is chunk three, and the fourth portion is chunk four, and each of these is on a different node. Now the Map tasks are going to be run on each of these four different nodes. There going to be a Map task that’s run on chunk one that just looks at this portion, the first portion of the file. Map task is run on chunk two that, that, that just looks at the second portion of the file and so on. And the the outputs of those Map tasks will therefore be produced, on on four different nodes. li, like so so here are the, here are the first chunk of Map output. The second chunk of Map output, which is on another node. The third chunk of Map output, which is on a third node. And the fourth chunk of Map output, which is on yet another node. Right. Now the output of the of, of the Map functions, are therefore spread across multiple nodes. And what the system then does, is that it it, it copies the, the Map outputs, onto a single node. And then so you can see the data from all these four nodes flowing into this single node here. And once the data has, has flowed to the single node, it can then sort it by key and then do the final radial step. Now it’s a little bit trickier than this unfortunately. Because you know, you may not want to use you know, to, to move all the data from all the Map nodes, going to be a lot of it, into a single Reduce node, and sort it there. That might be a lot of you know, a lot of sorting. So in practice you use multiple Reduce nodes as well. And you, you know, when you run a MapReduce job, you can say you know, you can tell the system to use a certain number of Reduce nodes. Let’s say you tell the system in this case, to use three Reduce nodes. So if you use three reduce nodes then the then the MapReduce system is smart enough to split the the, the, the output of the Map into, into three, three into three Reduce nodes. And it makes sure, that for any given key in this case the, all instances of the, regardless of which Map node they start out from, always end up at the same Reduce node, right? So all instances of the, whether it started from Map node one or Map node two ended up at Reduce node two, in this case. And all instances of the word crew regardless of whether they started from Map node one or Map node four, ended up at Reduce node one. And this is done by using a hash function, right? So the system uses a hash function that hashes each Map key and determines a single Reduce node to shift that tuple two. And this ensures that all tuples with the same key, end up with the same Reduce node. And once once tuples end up at a Reduce node, they get sorted as before. in, on each Reduce node and and, and the result is created now, on multiple Reduce nodes. For example the result for crew is now on is now on Reduce node one. The result for the is now on, on the Reduce node two and the result for shuttle and recently are on Reduce node three. So the final result is actually now spread across three nodes in the system. Which is perfectly fine because you’re dealing with a distributed file system, which know, knows that your file is spread across three nodes of the system. So you can still access it as a single file in your client. And the system knows to access the data from those three three independent nodes. One final point before we move on from the slide is that all this magic in the MapReduce magic is implemented to use as far as possible, only sequential scans of disk as opposed to a random access is. If you think a little bit carefully, what all the steps that I mentioned about how the Map function is applied on the input file record by record. How the sorting is done and so on. A moment’s thought will make it apparent that you can actually implement, all of this by using only sequential reads of disk, and never using random accesses of disk. Now this is super important because sequential reads are much, much more efficient than random accesses to disk. If you don’t learn your basics of of database systems, it takes much, much longer to do random seeks. Than to do a single sequential axis of a file. And that’s why the, the MapReduce, the whole MapReduce system, is built around doing only sequential reads of files and never random accesses. So here is the actual pseudocode for for the word count using MapReduce. Remember, the programmer is required to provide two functions, a Map function, a Reduce function. And this is the the Map function right here. The Map function takes a key and a value and its output has to be int, an intermi, a set of intermediate key-value pairs. Now the key in this case is, is a document name and the value is the text of the document. And the Map the Map function itself is very simple in this case. It scans the the input document. And for each word in the input document [INAUDIBLE] the input document. For each word in the input document it emits that word and the number 1. So, so it’s, it’s a tuple, whose key is the, is the word. And whose value is the number 1. And here’s the reduced function. The reduced function, remember, takes a key and a set of values. The set of values all correspond to the same key and in this case, they just iterate through all the values and and, and sums them up. And the output has the same key and the value is the, is the sum. We looked at a very simple example of bullet count using MapReduce. Now let’s look at a couple more examples. Here’s here’s here’s another example. Suppose we have a large web corpus that we’ve called and, for each and we have a metadata file for a, for [INAUDIBLE] and each record in the metadata file lo, looks like this. It has a URL the size of the file, the date and then various other pieces of data. Now the problem, is for each host we want to find the total number of bytes. And not for each URL, but for each host. Remember, there can be multiple many URLs with the same host name, in the crawl and you want to find the number of bytes associated with each host, not with each URL, right? Clearly the the number of bytes associated with the host, is just the sum of the number of bytes associated with all the URLs for a, for the host and this is very easy to implement in in, in MapReduce. The mapper in this case, the Map function just looks at each record and it looks at the URL of, of the record and outputs the hostname of the URL. and, and, and the size, right? And the the Reduce function just sums the sizes for each host, right? And at the end of it, you will have this the, the, the size of each host. Here’s another example. Let’s say you’re building a language model by year. You have a large collection of documents. And you want to build a language model and and this language model for some reason requires the count of every 5-word sequence. Every unique 5-word sequence that occurs in a large corpus of document. Earlier we looked at accounting, each unique word. This example ask for each 5-word sequence. It turns out that the solution is not very different. the, just the Map function differs. The Map function extracts you know, goes through each document and outputs every 5-word sequence in the document. And the the Reduce function just combines those counts and adds them up and then you have the output. So I hope these simple examples illustrate how MapReduce works. In the next section, we are going to understand how the underlying system, actually implements some of the magic that makes MapReduce work.

3. Scheduling and Data Flow (12:43)

Welcome back to Mining of Massive Datasets. In, the previous section will be studied the Map-Reduce model and how to solve some simple problems using Map-Reduce. In this section, we’re going to go under the hood of a Map-Reduce system and understand how it actually works. Just to refresh your memory a Map-Reduce system has simple Map-Reduce system has three steps. In the Map step, you take a Big a document which is divided into chunks and you run a Map process on each chunk. And the map process go through each record in that chunk and it outputs an intermediate key value pairs for each vector in that, in that chunk. In the second set step which is a group by step you group by key. You you bring together all the values for, for the same key. And in the third step, is a reduce step. You apply a reducer to each intermediate key value pair set. And you create a final output. Now, here’s a schematic of how it actually works in in a distributed system. The previous schematic was how it worked in a, in a centralized system. In a distributed system, you actually, have multiple nodes and map and reduced tasks are running in pattern on multiple nodes. So the here are the few chunks of the file, input file might be on on node 1. Few chunks on node 2 and a few chunks on node 3. And you have map tasks running on, on each of those nodes. And and producing producing it to be intermediate key value pairs on each of those nodes. And once the, once the intermediate key value pairs are produced, the underlying system the Map-Reduce system uses a partitioning function which is just a hash function. So the the the Map-Reduce system applies a hash function to each intermediate key value. And the has function will tell the Map-Reduce system which, reduce node to send that key value pair to. Right, this ensures that all all the, the same key values, whether they are map task 1, 2, or 3 end up being sent to the same reduce task. Right? So, in this case the key key 4. Regardless of where it started from, whether at 1, 2, or 3. Always end up at reduce task 1. And the key, key 1 always ends up at reduce task 2. Now, once once the reduce task has a reduce task has received input from all from all the map tasks. All the map tasks have completed, then you can start the reduced tasks. And the, the reduced tasks first job is, is to sort, it’s input, and group it together by key. And so in this case, there are three values associated with the key key, key 4, they’re all grouped together. And once that is done, the reduce task then, works the reduce function which is provided by the programmer on each each such group and creates the final output. Okay. So remember, the programmer provides two functions, Map and Reduce, and specifies the input file. The Map-Reduce environment take, has to take care of a bunch of things. It takes care of Partitioning the input data. Scheduling the program’s execution on a set of machines. Figuring out where the map tasks run, where the reduce tasks run, and so on. It performs a gr, the intermediate group by step. And while all this is going on some nodes may fail. And the environment make sure that the node failures are hidden from the from the program. And finally the Map-Reduced Environment also Manages all the required inter-machine communication. [SOUND] So, we’re going to take, take a look at exactly how what’s, what’s going on in a. So, let’s look at the data flow that’s associated with with, with map reduce. Now the the input and the final output of a Map-Reduced program are stored on the distributed file system. And the scheduler tries to schedule the map task close to the physical storage location of the import data. What that means is that recall the input data is, is a file. And the file is divided into chunks. And there are replicas of the chunks on different chunk servers. The Map-Reduce system try to schedule each map task on a chunk server that holds a copy of the corresponding chunk. So, there’s no actual copy. A data copy associated with the map step of the Map-Reduce program. Now, the intermediate results are, are at least not stored in the distributed file system but stored in the local file system of the map and reduce workers. what, what are intermediate results? Intermediate results, intermediate results could be the output of a map step. An intermediate result could be something that, that limited why, why in the process of computing the reduce. Now why, why are such debated results not stored in the distributed file system? It turns out that there’s some overhead to storing data in the distributed file system. Remember there are multiple replicas of the data that need to be made. And so there’s a lot of copying. And network shuffling involved in, in storing new data in the distributed file system. So, whenever possible, intermediate results are actually stored in the local file system of the Map and Reduced workers, ended up being stored in the distributed file system to avoid more network traffic. And finally, as you’ll see in future examples the output of a Map-Reduce task is often being the input to another Map-Reduce task. Now, the master node takes care of all the coordination aspects of a Map-Reduce job. The master node keeps, you know, associates a task status with each task. A task to see the map tasker reduce task. And each task has has a status flag. And the status flag can either be idle, in progress, or completed. The master schedules idle tasks whenever workers become available. Whenever, there is a free a node that is tha, that’s available for, for scheduling tasks. The master goes through it’s queue of idle tasks, and schedules an idle task on that, on that worker. When the, when a map task completes, it sends the the master the location and sizes of it’s the R intermediate files that it, that creates. Now, why, R intermediate files? There’s one intermediate file that’s created for each reducer. Because the data, the output of the mapper has to be shipped to each of the reducers, depending on the, on the key value. And so there R intermediate files, one for each reducer. So, whenever, a map task completes, it let it’s, it’s, it’s stores the R intermediate files. On it’s local file system, and it let’s the master know what the names of those files are. The master pushes this inf, information to the reducers. Once the reducers know that all the mappers map tasks are completed, then they copy the intermediate file from each of the map tasks. And then they can proceed with their work. Now, the master also per, periodically pings the workers, to detect whether a worker has failed. And if a worker has failed, the master has to do something. And we’re going to, see what that something is. If a map worker fails, then the, all the map tasks that were scheduled. On that on that map worker may have failed. So, the the tricky thing is that the output of a map task is written to the local file system of the, of the map worker. So, if a map worker fails, then the node fails. Then all intermediate output created by all the map tasks that have ran on that worker, are lost. And so the, what the master does, is that it resets to idle, the status of every task that was either completed or in progress on that worker. Right, and so all those tasks need to be, eventually be done, and they will eventually be rescheduled on other workers in the course. If a reduced worker fails on the other hand, only the in progress tasks are set to idle. The tasks that are actually been completed by the reduced worker, don’t need to be set to idle. Because, the output of the reduced worker is a final output, and it’s written to the distribute file system. And not to the local file system of the reduced worker. Since, the output is written to the distributed file system. The output is not lost even if the reduce worker fails. So, only in-progress tasks need to be set to idle. While completed tasks don’t need to be redone. Right? And so, the and once again the Idle reduce tasks will be restarted on other workers eventually. What happens if the master fails? If the master node fails, then the map reduce tas, task is aborted. The client is notified, and the client can then do something like restarting the map reduce task. So, this is the one scenario where the task will have to be restarted from scratch. Because, the master is typically not applicated in the Map-Reduce system. So, you might think that, this is a big deal, that that the, the master failure means the the map-reduce task is aborted, and the task has to be restarted. But remember, node failures are actually, rather rare. A node fails actually recall once every three years, or once every 1,000 days. And the master is, is a single node, and therefore, the chance of the master failing is actually quite, you know, it, it, it’s quite an uncommon occurrence. the, the, the problem that you have with if, you have a multiple workers associated in, in a map reduce task. It’s much more likely that, one of many workers failed, rather than the master failing. So, the final question to think about is, how many map and how many reduced jobs do we need? [NOISE] Supposed you know, they’re both throughout M map tasks and R reduce tasks. Our goal is to determine M and R. The, this is part of the input that given to the map reduce system to let it know how many tasks tasks it needs to schedule. The Rule of thumb is to make M much larger than the number of nodes in the cluster. [][You might think, that it’s sufficient how one map task per node to the cluster.] [][But, in fact, it the rule of thumb is to have one map task per DFS chunk.] [][The reason for this is simple.] Imagine, that there is one map task per node in the cluster and during you know during processing the node fails. If a node fails then that map task needs to be rescheduled. On another node in, in the cluster when it becomes available. Now in, some, since all the other nodes are processing, you know, one of the map tasks has to, one of those nodes has to complete before this map task can be scheduled on that node and so, the entire computation is slowed down. By the time it takes to com, you know, complete this map task. The failed redo the failed map task. [][So, if instead of one map task on a given node, there are many small map tasks on] [][a given node, and that node fails, then those map tasks can then be spread across] [][all the available nodes and so the entire task will complete much faster.] On the other hand, the number produces R is usually smaller than M and is usually even smaller than the total number of nodes in the system. And this because the the output file is, is spread across spread across R node where R the number of reducers. And if it’s usually convenient to have the output spread across a small number of nodes rather than across a large number of nodes. And so usually R is set to a smaller value than M.

4. Combiners and Partition Functions (12:17) [Advanced]

Welcome back to Mining of Massive Datasets. In the previous lectures we studied the Basic Map-Reduce model, and then we looked at how it’s actually implemented. In this lecture, we’re going to look at a tuples of refinements to the basic Map-Reduce model, that can make it run a bit faster. [SOUND] The first refinement we’re going to look at is combiners. Now, one of the things that you may have noticed in the previous examples, was that the map task will produce many pairs of key value pairs with the same key. For example popular word, like the, will occur in millions and millions of key value pairs. Now, remember that the map tasks are actually happening in parallel on multiple worker nodes. And the, the key value pairs from each map node have to be shipped to to, to reducer nodes. If you sort of imagine a word like the, the on, on node 1, map task 1, it’s probably going to see a few thousand occurrences of the word the. And map task 2 is going to see a few thousand occurrences of the word the, and so on. So the output of map task 1, will have let’s say 1000 tuples, with the key the and value of one. Now all these, tuples will have to be shipped over to let’s say to the new task 1. . [][Now, by shipping a thousand tuples over all of whose you know, keys are the,] [][all of whose values are one it’s a lot of network overhead.] And, you can save some of this network overhead by doing an intermediate sum in the, in the map worker. For example, if we’re sending thousand tuples that each say, that, that each have the key the and the value of one. You can send a single tuple that has the key the and the value of 1000, right? And so, you can save a lot of network bandwidth by doing a little bit of pre-aggregation in the map worker. Here’s a mapper and the mapper is this is about code example again. The mapper the we, has b occurring once, c occurring once, d occurring once, e occurring once. D occurring once and b occurring, once again. And now the, the tuple b occurs two times here having in the output of this mapper. So, l, the combiner which is another function that is provided by the programmer. Combines the two occurrences of B, and produce a single tuple, B comma two which is then shipped, shipped over to the reducer. Since, we have two tuples of the form B1 being shipped over to the reducer, a single tuple of the form B2 gets shipped over to the reducer. And this way much less data needs to be copied and, and shuffled. So the, the combiner is actually also supplied by the programmer. The programmer provides a function combine. The input to the combiner is is, is a key and a list of values. And the output is a single value. So, instead of a whole bunch of tuples with the key k being shipped off to a reducer. Just a single tuple with key k and v2 is shipped off, to the reducer now usually the combiner is the same function as the reducer. So, if for example, if a reducer adds up its input values the combiner does the same thing as well. Further, we have to be careful, because this trick of using the combiner works only if the reduce function is commutative and associative. Let’s look at a couple of examples to see what what I’m saying here. So for example, let’s say the, the reduce function is a sum function. You want to add up all the input values, as in the count example. Now the, the sum function actually is commutative and associative; by which we mean, that, a plus b. Is b is the same as b plus a. And a plus b plus c is the same as a plus b plus c. This is the first property is the commutative property. And the second property the associative property. And because sum satisfies both these properties sum can be used as a combiner as well as a reducer. What that really means, is that if you have a lot of values that need to be summed. All these values need to be added up. We can break it up into two pieces. You you can sum up the first piece. You can sum up the second piece. And then you can sum up the, the two intermediate results and you’ll get the, you’ll get the proper, you’ll get the same, same answer. Okay, right, so, and so this is the first combiner sums up the, the, the, the first, set of output with the second, combiner sums up the second set of outputs. Then you sum up the two intermediate values, and you get the same result as if you had summed up all the original values to begin with. So, that trick works, because the sum is commutative and associative. However, there are some functions that are not commutative and associative, an example. Might be average, right? Let’s say the reducer needs to compute the average of the, of its setup input value. So this is, so the setup input values consist of a, a key, followed by a bunch of values. And the combiner the reducer needs to find the average of this set of values. Now, lets say, we divide this set of values into two sets. Compute the average of the set. Compute the average of this set, let’s say that’s average 2. And now we take the average of average 1 and average 2. That’s the average of average 1 and average 2. Now, it turns out that this is actually not. The same as the average of all the values that are out there. So, the average function that we’ve seen is not commutative and associative. And so, you can’t use it as a combiner. But it’s turn out you can still use the combiner trick if you are a little bit careful instead of using average as your reduced function. If the reduce function instead outputs you know, outputs a pair, which consists of sum and count, okay? Then the average can be computed in 1x plus trap, it’s just the sum divided by the count. So if, if the combiner ends up sending the average of all its values. Let’s say the key [SOUND] and values and here are the chunks. Now the combiner, the first combiner, compute the sum of this piece, and the count of this piece. The second combiner, compute the sum of this piece, and the count of this piece. And the third combiner, compute the sum of this piece and the count of this piece. And the, finally all these all these values, the sums of the counts get shipped to the reducer. And the reducer computes the final sum. Which is sum of the, the, the intermediate sums it has received. [SOUND] The final count, which is the sum of all the counts that it has received. [SOUND] And divides the sum by the count, the final sum by the final count. That, in fact, turns out to be the correct average. So, using this using this trick of using sums and counts it’s sometimes possible to turn a function that’s not commutative or associative. Break it down into functions that are communicative or associative like sum and count and still use a combiner trick to save some foot traffic. Unfortunately it turns out that while while most functions are amenable to the combiner trick. There are some functions that don’t work with the combiner trick at all. One example is is median. Right? The median of a set of values is obtained by sorting you know, [INAUDIBLE] sorting that set of values. And then finding the middle, the middle value in that, in that sought it list. It turns out and it can be prov minuen mathematically that there is no way to split the median competition. Into a bunch of commutative and associative computations. So you can’t actually use the combiner trick if your goal is to come through the median of a set of values. You just have to ship all the values to the reducer and compute the median at the reducer. The next refinement we are going to look at is the partition function. Now, remember that the map reduced infrastructure uses a hash function on each key in the intermediate key value set, and this hash function decides which reduced node that key gets shipped to. The map reduce system uses a default partition function which consists of hashing the key using a pre-defined hash function. And then taking the result modular R. Now, this gives a number from zero to R minus 1 which decides which reducer the key is sent to. Sometimes you may want to override this partition function with a custom partition function. For example. For example, you might want to ensure that all the URLs from a given host that say end up in the same output file. And are therefore sent to the same reducer. So instead, of hashing by key, you might want to hash by the host name of the URL and the map reduce framework allows you to specify the custom partition function that can do things like this. The initial implementation of MapReduce was done at Google. And Google first implemented a file system called the Google File System which is a distributed file system that provides table storage on top of its cluster. And then implemented the MapReduce framework on top of the Google File System. Google’s implementation MapReduce is not available outside of Google. Hadoop is an open-source project that’s a reimplementation of Google’s MapReduce. It uses a file system called HDFS for stable storage. And it’s implemented in Java. Hadoop is an Apache project. And you can freely download it from the Apache website. It turns out that many use cases of Hadoop involve doing SQL-like manipulations on data. And so there are open-source implementations called Hive and Pig that provide SQL-like abstractions of top of the Hadoop and MapReduce layer, so that you don’t have to rewrite those as map and deduce functions. That finally wrap up by looking at Map Reduce in the Cloud. Amazon’s Elastic Computer Cloud, for example, is one example of a service where you can rent computing by the hour. And Amazon also has an implementation of Map Reduce called Elastic Map Reduce that you can run in the Cloud. This concludes our discussion of Map Reduce.

[][Reading notes] [][2.2.7 Exercises for Section 2.2] Exercise 2.2.1 : Suppose we execute the word-count MapReduce program de- scribed in this section on a large repository such as a copy of the Web. We shall use 100 Map tasks and some number of Reduce tasks. (a) Suppose we do not use a combiner at the Map tasks. Do you expect there to be significant skew in the times taken by the various reducers to process their value list? Why or why not? Yes, the duplicate is each batch can be significantly diff (b) If we combine the reducers into a small number of Reduce tasks, say 10 tasks, at random, do you expect the skew to be significant? What if we instead combine the reducers into 10,000 Reduce tasks? 10 is good but 10,000 is bad, in English there are 10,000 frequently used words the word “the” will appear over ten million times with ten billion 1s ! (c) Suppose we do use a combiner at the 100 Map tasks. Do you expect skew to be significant? Why or why not? the combiner will remove the duplicate in each task, thus only lifted is the duplicate between each task, so there will be very less skew

[][Supplemental materials]

[][Dear Community,]

I have a Mapreduce job which processes 1.8TB data set. My map task generates around 2.5 TB of intermediate data and the number of distinct keys would easily cross a billion . I have set a split size to be 128MB. So, total number of splits generated is approximately 14,000/-. I have set a number of reducers to be 166. My cluster size is 8 nodes. 7 nodes are data nodes out of 8 nodes. 1 is a name node. Each data node has got 24 logical cores and 128GB RAM. When the job is running with this configuration, map completes its execution but my reduce phase stucks at 26%. May i know that what should be the split size and number of reducers i should have for this particular problem with my current cluster size. Please provide suggestions. Thanks.

[][Lets start with basics and try to answer your questions.]

  1. Split size = Hdfs block size by default . changing the split size will have a impact on the number of mappers and not reducers. 128 MB split size is good to start with.

  2. Rule of thumb : A reducer should process 1 GB of data ideally going by this logic you should have : 2.5TB / 1 GB = 2500 Reducers ,

  3. you have 20 * 7 = 140 containers(available in one go ) to run reducer , running 2500 reducers will take 2500 / 140 = 17 rounds which is a lot . Hence I will fix My reducer to some where aound 800 to 900.

  4. Your mappers is producing more data as intermediate , what are you doing in this step , can you use combiner(is it possible) to make this intermediate data small ? can you move some filter operation at map stage .

  5. If your reducer are getting stuck at 26% there can be several reason

  6. You have a skewed key which results in one reducer getting stuck

  7. 26% means its is stuck at shuffle phase itself , which is stating one reducer is getting a lot of data(another indication of skewed joins)

  8. Have you enabled compression for map output ?

[][kgautam, Thanks for your reply.]

  1. Currently, I’m not using any combiner. My map phase output <key,value> pair is <string/text,string/text>. As my value is string/text in map phase output <key,value> pair, I think that It will be difficult to write the combiner. Usually,the function of the combiner is same as the reducer. Here, I’m not able to think of writing the combiner for this particular problem.

  2. Currently,we tried with this compression for map output “-D mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.Lz4Codec”. Is this configuration enough to compress map output? Do we have to modify or write some statements in our mapreduce code to use this compression?

  3. May i know that where do you get this rule of thumb “A reducer should process 1 GB of data” ?

  4. When i have 24 logical cores in one data node, Why you have mentioned 20 * 7? I think that it should be 24*7?

  5. How to handle skewed key? Can i handle it using partitioner? Do we have any other way?

Thanks.

[][1) Currently, I’m not using any combiner. My map phase output <key,value> pair is <string/text,string/text>. As my value is string/text in map phase output <key,value> pair, I think that It will be difficult to write the combiner. Usually,the function of the combiner is same as the reducer. Here, I’m not able to think of writing the combiner for this particular problem.]

Is it possible to write key , List , you can byte-serialize the value , or use thrift definition to have the values together in one structure.

You are saving on not emitting the same key again . Use this if your specific use case permits. Mapper generating more data as comped to input means , it is emitting more records as received , which means key duplication is happening (Generally speaking)

  1. Currently,we tried with this compression for map output “-D mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.Lz4Codec”. Is this configuration enough to compress map output? Do we have to modify or write some statements in our mapreduce code to use this compression?

conf.set(“mapreduce.compress.map.output”, “true”) conf.set(“mapreduce.output.compression.type”, “BLOCK”); conf.set(“mapreduce.map.output.compression.codec”, “Use one enabled in your cluster LZ4/SNAPPY/GzipCodec”);

  1. May i know that where do you get this rule of thumb “A reducer should process 1 GB of data” ? This is mostly used in all framework like Pig , as well as generally Reducer heap is in order of 1.5 GB . This is a default number. Please tune according to your needs .

Time taken to process X amount of data = Time to spawn the process(Scheduling time ) + time to do IO from file + time to process the logic. 1 GB is the minimum size below which job spawning time is comparable to the time taken for processing.

  1. When i have 24 logical cores in one data node, Why you have mentioned 20 * 7? I think that it should be 24*7?

I left 4 *7 containers for other services and other jobs running in your cluster. (Always good to underestimate while doing calculations for performance )

  1. How to handle skewed key? Can i handle it using partitioner? Do we have any other way? There are Many ways , best is to have a look at your skewed key and come to logical conclusion
  1. Look at Pig Skewed Join implementation (add salt to your key and then reduce twice , divide and conquer )
  2. Take top N events for a given key (If logic permits).

Course / Module 2: PageRank / Outline of Module 2

README

6. PageRank: The Flow Formulation (9:16)

So, the first approach to computing importances of the web pages in a big web graph, is called PageRank. So what we will do now is we will come up with the first mathematical formulation of PageRank. We will first talk about it intuitively, then we’ll mathematically formalize it. We will then talk also about how to compute, actually compute the important scores. And we will see that our initial formulation is broken, so we will then later fix it. But for now, we will, we will just come up with the initial formulation and this is called the flow formulation for the PageRank item so here is the idea. The idea is we think of links on a web graph as votes, right? So the idea is that the page quote on noting a graph is as important as the number of links it gets. Of course, the first question is what kind of links are we talking about? Are we talking about in-links? Are we talking about outgoing links? For example, we will look at in-links. In-links because in-links are kind of harder to fake than out-links. It’s very easy to have a page with lots of out-links. It’s harder to have a page that lots of other pages on the web point to. So that’s the first question. The second thing is that, it’s not enough just to consider in-links, but we also have to consider where this link is coming from. So for example a link from a given web page maybe from stanford.edu is more important than a link from some other web page that only receives very few in-links. Right? So the idea is that not all in-links are equal. Kind of links coming from important pages are worth more. And here, here is basically the idea. The idea is that an importance of a page is, is the, in some sense, depends on the importances of other pages that point to it. So it kind of, we have this recursive definition where importance of your given page depends on the importances of the others. And this importance then kind of gets po, passed on further through the graph. So this is the first idea and just to give you how an intuition, how PageRank scores of a graph look like. Here in this slide I’m showing you a, a small, a small graph with a set of directed edges. And here in this graph, the size of the nodes is proportional to it’s PageRank scores and here I normalize the PageRank score so that they sum to 100. And what do we see is the following. We see for example that, that node B has a very high PageRank score. The reason why it has high PageRank score is because it has lots of, lots of other pages point to it. For example, node C also has relatively high PageRank score, even though it receives only a single incoming link. The reason why C has a high PageRank score is because this very important node B is pointing to it. And then for example, you can see that this very small dark red links nodes have a relatively small PageRank scores. For example, the PageRank score of node D is higher than the PageRank of node A because D points to A. For example, we see that the node E has kind of intermediate PageRank score. And E points to B and so on. Right? So the results that we get from PageRank are kind of intuitive and they correspond to our intuitive notion of how important is a node, is a node in a graph. So now, what we look at this how do we compute this kind of scores or importances of nodes in a graph? So the idea is the following. The idea is that we will come up with a Simple Recursive Formulation. [][Where we sh, where we sh, think of each link as a vote.] [][And we think of the importance of a given vote, to be proportional to] the importance of the source web page that is casting this vote or that is creating a link to the destination. So the way we think about this is the following. Let’s think we have a page j with an importance r sub j. And for this leg importance be some number. And think that n, that the page j has n outgoing links. Right? So then, the way we will do, say is that now this importance rj of page j. Basically, gets split on all of its outgoing links evenly. So, the each link gets r sub j divided by n votes or amount of importance to spread to the target. So, so now this is how the importance gets from j to the pages it points to. And in a similar way, we can define the, the importance of j as the sum of the votes that it receives on its in-links. So, if we look at the simple graph here at the bottom. The idea is that the importance r sub j of web page j is simply the importance of page i the green page. Divided by three because page i has three out-links plus importance of page k divided by 4. Why by four? Because page k has four out-links. One, two, three and four. So this is how we compute the pa, the score of of page j. And now that we have the score of page j, the score further gets propagated outside of j along the three outgoing links. So each of these links, gets the importance of, of node j divided by 3. And this is basically all that is to this formulation. As every node collect importances of the pages that points to it. [INAUDIBLE] has its own importance and then kind of propagate it through, through their neighbors. [][So the idea is that basically this vote flow through the, through the network.] [][So that’s why this is called the flow formulation of a flow model of PageRank.] And just to give you an idea how this would work out. Here is a very small web graph, you know, from prehistoric times when the web only contained three websites a, m and y. And imagine that this is this is the structure, right? So y has a self link and then points to a. A points points to m. M points backwards and so on. And our initial idea as said before is the vote is, is from an important page is worth more. So, a page is more important if it’s pointed to by other important pages. So as I, as I kind of, I hinted on the previous slide, we will assign an important score. R to a page j and we will call this important score rank. So this is where the PageRank terminology comes from. So we, we call this importance to be rank. And now, the, our formula that we have for computing PageRank is very simple. We simply do say that the importance score of page j is simply the sum of all the other pages, i that point to it. Importance of that page i divided by the out-degree of the page. Right? So now, what they can do is basically dismiss it for every node in the network, we obtain a separate equation. So, for example, the importance of node y in my network is simply the importance of y divided by 2. Plus importance of a divided by 2. So why is that? Because y has two outgoing links. One link points, points to itself so it’s y divided by 2. And then similarly node a has two outgoing links. So, we take our a and divide it by 2. For example, if you say, what is the importance of node m? The importance of node m is importance of node a divided by 2. Again, why divided by 2? Because node a has two outgoing links and half of the, half of its importance goes to a. And half of its importance goes to, goes to y, as we saw here. Right? So now, it almost seems like we are done. Right? We have this set of equations that we would like to solve, right? We have three equations, three unknowns, no constraints and we want to solve this. The problem is that this has no unique solution. The reason is that the, the system is under constrained. So basically, all solutions will, will be, we can find an infinite set of solutions to these set of equations. And all, what these solutions will have in common? They will be equivalent up to the scaling factor. So we need, we need an additional constraint. So the constraint we will add to our system is to say that our paging scores half to sum to 1. So r, ry, ra plus rm has to be equal to 1. So now, we list additional equation. We can basically have now, three unknowns for equations. We can go solve this. For the small graphic, we can go solve this by hand. And we would come up with the, with the, with the solution which are our initial page, PageRank scores. Right. So for example y has score of 5 over 2. A has score of 5 over 2 and then m has score of one-fifth. Right? So, it seems like as, that we are done. [][Right? So we could use any kind of linear system] [][equation solving method.] [][For example, Gaussian elimination.] And be able to compute the importances of nodes in the graph. You know, this approach would work well for very small graphs, but it won’t, it wouldn’t, it won’t work for a, for a size of the web graph. So basically, for a graph where we have a billion web pages because it would mean that we have a billion of equations. A system of billion of equations that we would want to solve. So we need a different formulation.

7. PageRank: The Matrix Formulation (8:02)

So, in order to come up with a different mathematical formulation of this basic idea of modelling the flow, we will first define what we will call our stochastic adjacency matrix M. And then we will express everything in terms of linear algebra. In terms of basically the ranks and this matrix M as vector matrix multiplications. And we will later why this is good because we will be able to start using the linear algebra tools to implicitly solve the system of equations that I showed on the previous slide. So here is how we proceed. Our goal is to define the stochastic matrix M, that will basically be an adjacency matrix. Which basically means, we want to take the graph and represent it as a big matrix of values. And the idea is that if a page i points to page j then we will have a non-zero entry in the cell ji, and if there is, if the page, page j does not point point to page i then we, we will have a 0 entry there. So now the question is what is the value of the non-zero entry in, in the cell of the matrix. So the idea is if i points to j then the corresponding entry ji in the matrix M will be o over the f the out degree of the, of the source node. So out degree of node i. Here you will already seen what kind of where we are going right? Before we said that whatever is the importance of a node, this importance gets evenly split along all of its out-links. So this means that, that the, all the out-links of node i, will have the weight 1 over di. So this means that our matrix is called column stochastic. This means that every column, every column in our matrix, sums to 1. Okay, so now that we have the whole graph represented as a matrix you can also take all the page rank scores of nodes and represent that as a vector. Okay. So the way we do this is that we think that we have one entry in our vector per page, we can think that our pages are numbered 1, 2, 3, 4, 5 up to N so we have a vector length N. And every entry in this vector basically corresponds to the page rank score of a given, of a given page. Okay? So that is all good and the other thing we know from before is that the sum of the entries of our vector equals to 1. That was the constraint of the flow equations we had, we had on the previous slide. So now what is interesting is that we can take our flow equations, kind of our basic equation. And write it as a in terms of the matrix M and the vector r. So we can write it as rank vector r equals the matrix M times the vector r again. So now we basically have a big system of equations, right? M is fixed and we want to figure out what are the values of r. So just to convince you or demonstrate why, why, why we can take our initial fl, flow equations and express them into this vector matrix product. This may not be obvious, so here is how, how we can understand that what we are doing is actually true and correct. So the idea is the following, right? Imagine that I have my matrix M here at the bottom, and I have my vector r. And now I am multiplying M times r. And just for the sake of the example, let’s assume that page I has the, has the degree, out degree of i equals 3, and it also points, which means it points to three other pages, including j. This means that for a page i, the col, the i-th column of matrix M will have three non-zero elements. Here are indicated by squares, and each of these, each of these three non-zero cells will have a value of one-third right? 1 over the out degree of node i. So now imagine what happens when I take the j throw and multiply it with the vector r. When I’m, when I’m scanning across the row here and I’m scanning down the vector, the vector, the vector r. I’m basically computing the page rank score of node j, right? The page rank score of the node j is the sum of the importances that are stored in r, times the out degree of that node that points to j. So this way basically we take this initial equation that we, that we had before and we express it as a vector matrix product. So now we basically took our flow formulation of the problem and expressed it as this recursive, in a sense, matrix equation of r equals M times r. Now, what we observe is that this looks very much like an kind of Eigenvalue or problem. So let me just remind you what are Eigenvalues and eigenvectors of a given matrix. Right? So if I have a matrix A then x is called an eigenvector with the corresponding eigenvalue lambda. If x is a solution to the equation, A times x equals lambda x. Okay? So, just saying it again. A is a matrix that we, that we are given. X is something that we’d like to compute and it’s a vector. And lambda is also something that we’d like to compute, and is a scalar. Is a, is a real number, or a complex number. So, the point being is that x is an eigenvector, y is a eigenvalue, if they are solution to this equation Ax equals lambda x. So in, in our equation looks very much similar. Like, at least have M times r which is the same as kind of A times x. And then we say equals r, and before we had equals lambda x. So what this means, is that a rank vector is an eigenvector of the stochastic web matrix M. And another important fact is that it is a principal eigenvector. Which means that it corresponds to the eigenvalue with value 1. Right, so, here I can see that I implicitly multiply by 1 and my lambda is 1. Right? In fact the large, largest eigenvalue of M is once, is, is 1 exactly because M is column stochastic. Why is that the case? That’s the case because vector r has a unit length. Meaning its coordinate sum are non, are non, are non-negative and they sum to 1. And each column of M also sums to 1. So M times r will be the, the value of that product, of that dot product, will be at most at most 1. So this means that the that the corresponding eigenvalue the largest eigenvalue of our matrix is 1, okay. So why did, why did we do now. So far we took our graph represented it represented it as this big matrix, and we reformulated our flow equations into this matrix formulation and now we establish the connection between the matrix formulation and the eigenvalues and eigenvectors of matrix M. So, what, what this now means is basically instead of thinking of this as solving a system of equations, we can think of, of our problem as finding the eigenvector of, of matrix M. And actually there is a very efficient method for finding eigenvectors of a given matrix. And this method is called power iteration. So now we actually know how to compute page rank. The way to compute page rank is to find the, the eigenvector of matrix M that corresponds to the eigenvalue of value of 1. So that’s what we learn so far. So now the, we can actually go an compute the thing. So let me show you what we have so far. We have our little graph web graph on three nodes. We have our flow equations. And we also now can write what the structure of our matrix-m. Here is the structure or our matrix-m. Notice that the matrix is really column stochastic. So now, what we can, what, how we can think of our flow equations, we can think of them as this vector matrix product, right? So r, r equals M times r.

8. PageRank: Power Iteration (10:34)

And now the question is how do we compute, how do we compute the eigenvector to the, to the, or the solution to this problem r equal N times r. So the way we, we proceed is the, is the following. So the method is called the power iteration method and it assumes that on the input we are given a big web graph on N nodes. Where nodes are red pages and directed links correspond to hyper, hyperlinks. And then we think we, the power iteration is a very simple iterative scheme. The way the, the whole thing works is the following. We will start with our vector r, I, I, I have this subscript r of 0, which simply means that this is the, this is measuring the time, how the iterations proceed. So r our initial guess of our ranking vector r is simply that all the components of it are 1 over N. So, where N is the number of nodes. So naturally the compo, the comp, the entries of r sum to 1. So now all we do is we iterate our, our recursive equation. So we say that values of r at time t plus 1 is the matrix M, the, the stochastic adjacency matrix, times our previous vector r, r, r t. And we keep iterating this. And basically all we are doing is we are iterating this r equals M times r. And we keep iterating this until r stops changing. So this means we keep iterating this until this sum of the, let’s say, coordinate wise sum of the differences between the r of the current time step and r of the previous time step is less than epsilon. Right? So, and this is really, really all there is to the, to the page rank algorithm. We start with some guess of how our vector rank vector r is, then we multiply it with M usually around 50 or 100 times. And we keep monitoring how much does r change from one iteration to another, and when it stops changing, we stop. And, what we get is the page rank scores. So, of course if we have if we have this this algorithm, the question is how, how is this working? So let me just give you, give you an example using our old web graph idea. So, we have the three node web graph. We have, we have our matrix M here. We have the algorithm here on the left, a simple iteration as I mention before. And let me show how this would work. So, for example we start with r0 which is where the components of it are one-third, one-third, one-third. We multiply it with them and in the next time snap, so this will be r1, we would get the new vector. And then we could, we now compute r of, r, r of time 1 times M, we obtain r at time 2, and then again we would go multiply that again with M, would get r at time 3, and we would keep doing this. And at the end, r would actually converge to, to a vector that I will show you here where A and y nodes would have the importance of 6 over 15 and y would have the importance 3 over 15. Which is exactly the same values as we got before when we were actually trying to explicitly solve our system of flow equations, right? 6 over 15 is the same as 3 3 over 3 over 2 over 5 and 3 over 15 is 1 over 5. All right? So we got to the same solution as we had, as we got before when we were trying to solve a system of equation. But now we didn’t really kind of solve the system of equations explicitly, we simply did this vector matrix multiplication multiple times. And the thing converge, converged somehow mira, miraculously to the values we wanted. So, so far we looked at page rank in terms of a matrix formulation. So we, we express the set of flow equations as a vector matrix product, and then we saw that, instead of solving the flow equations, we can kind of find the, the eigenvector of a matrix M, in this way find the page rank scores. So what we will do next is we will look at an interpretation of what the page rank scores mean. And this is called a random walk interpretation. So basically we will see that page rank scores are equivalent to a probability distribution of our random walker in a graph. So, before I tell you the details, here is, here is a way how to think about this. We are thinking about the web graph as a giant graph and we are thinking about the, the the surfer. So a surfer is simply a person who is basically randomly surfing this graph. Which means that a, a surfer comes to a given web pages, web page, looks at all the outgoing links, peaks one at random and, and moves to the next web page. And the server is kind of browsing these graph indefinitely. So the idea is that at some given time t, surfer is at some node i, and what the surfer will do in the next time step at time t plus 1, basically the surfer will follow an out-link from i, and choose this out-link uniformly at random out of all the out-links at of node i. Okay? And then, now the surfer is at node j. So what time t plus 1 surfer is at node j, and again looks at all the outgoing links of node j and follows one of them at random. So now what we can also think about is, we can think of this vector p of t. And this p of t can, we can, we, we think of this as a probability distribution over the nodes of the graph, which basically tells us with what probability is a given, is a walker at time t at the given node. Okay. So we can see that we every node in a graph has a value associated with it. And this value corresponds to the probability that at a given time t, the, the random walker is at that given node. Okay. So now that we have defined the process and we have defined the notion of p of t. Now the next question is to ask, where is the random walker going to be at time t plus 1? Okay? So given, given the probability distribution where the random walker is at time t, that is called p of t, the question is, where is the random walker going to be at the next time step? And the answer to this is actually very, very intuitive. So, we can ask, what is the probability that the random walker will be at node j at time t plus 1? So if we want to compute this, then all that for node j what we have to look at is what are all the nodes that point to j. What is the probability that the random walker was at any of these nodes i, that point to j? And at every node i, the random walker basically has to go and take, take this link that points towards node j. So this means that whatever is the, was the, was the, was the probability that a given node that the random walker was at a given node. Now the random walker has to pick the out-link that points to node j. So which means that, that what we are basically getting is, is exactly our page rank equation if you, if you want to think about it this way. Right so, so the probability that the random walker is at the given node. Is simply the sum of the probabilities that the random walker in previous time step was at the neighbors that point to given node. And from every given node, the, the random walker transitions to the node j with probability of 1 over the, 1 over the out degree of that given node. Which is exactly the page rank the page rank formulation. Right? So this means that the probability distribution of where the random walker is either time t plus 1 is simply our matrix M times the probability distribution where the random walker was at time t. So now let’s suppose the following. Let’s suppose that the, that the random walk reaches what is called the steady state. Which means the probability distribution at time t is, equals the probability distribution of time t plus 1. This means that probability, p of t is a stationary distribution. So, probability solution at time t plus 1 equals M times p of t equals back p of t. Okay? So what we, what we observe now is that this stationary, stationary probability distribution p of t is, is exactly what was our, our original formulation of a of a random walk. What before we had r equals M times r. Now we have p of, p of t equals M times p of t. Right? So this means that our rank vector r is a stationary distribution for this random walk process, okay? So, this about this a bit. So basically what page rank score corresponds to? They correspond to the probability that this random surfer, that infinitely long kind of walks the, walks the web graph at a given, at a, at some given time t resides at the given node. So this is what is called the page rank, the random walk interpretation of page rank, where we can think of a score or a rank of a given node to be the probability that the random walker is at that given node at some, at some fixed time t. So, another important consequence of this random walk interpretation is that there is a rich literature on random walks. And random walks are really called Markov processes, or first order Mark, order Markov processes, because basically they have very little, very little history. And the central, the result from the Markov processes or random walk literature is that under certain conditions, basically conditions under matrix M, the stationary distribution is unique, and it will eventually be reached no matter what is the, the initial probability dis, distribution at the time t equal 0. So, what does this mean? This means that there are certain conditions on the structure of our graph. On the structure of our matrix M. And if our matrix M satisfies these assumptions, then the stationary distribution we, is unique. Which means there is only one unique rank vec, page rank vector r. And this unique page rank vector r will be, will be achieved regardless of how we initialize it. Which means that our pay power iteration will always converge to the same vector, regardless of how we initialize it.

9. PageRank: The Google Formulation (12:08)

So, so far we looked at the page length formulation, we looked at the linear algebra formulation, and we now looked at the random walk kind of intuition of PageRank. And we in the, in the last slide of previous lecture, we just said the that under certain conditions, these PageRank vector will be weak. So now the question is, what are these certain conditions that matrix M has to satisfy in order for the PageRank to exist and to be unique? And now what we will learn is basically we will learn the, the real, the Google formulation of the, of the PageRank algorithm. So, what we know so far is that the importance of a page j in, in, in a web graph is simply the sum of the importances of pages i that point to it, and when, when you sum these things together, we divide them by the out degree of, of the, of the source page. And we, what we established is that this equation that I have here is, simply can be written as a matrix equation. So now there are three questions we need to answer. First, does this M equal r equals M times r, does this converge? Second, does it converge to what we want? And third, are our results reasonable? So what we’ll do next is we’ll answer these questions one by one. So here is the first question, does this converge? Imagine a very simple graph, only two nodes. And node a points to node b, and b points back to a. And now imagine we want to run our power iteration. And as I, as I showed in the, in the last slide we said that the PageRank vector is unique and it, the stationary distribution will always be reached regardless of how we initialize our initial vector. So now imagine I, we initialize our, our vector r, r of, r of times 0 to be simply, to have, to have two values, to have one value 1 on the coordinate a and have value 0 on coordinate b. Now when we are multiplying M times r, what will happen is the score of a gets passed to b, and score of b gets passed to a. So the next time step the coordinates will flip. And now when we mul, multiply again, the coordinates will flip again. Right, so what we see here is that we will never converge. All, all that is happening here is that the score of one gets passed between a and b and score of 0 gets passed between b and a. So it seems that our, our PageRank computation will never converge. This problem is called the spider trap problem. I will explain it a bit more later. So, so far it seems that our method for computing PageRank the way defined it so far, doesn’t really work. So we looked at the spider trap problem that I’ll talk more about later. And here is another problem with the current formulation of PageRank, and this is called the Dead end problem. And here, the, the, the, thing is, even a simpler graph than what we had before breaks our algorithm. So lets consider a very simple graph. Two nodes, one edge, kind of, couldn’t be simpler. And let’s think of the same initialization vector as we had before. So a starts with score 1 and b starts with score 0. So what, what happens in this case is in the first multiplication with matrix m the score, the scores gets flipped. But now what happens in the second step of multiplication is basically the score of one gets lost, right? The, a, b is not able to pass the score to anyone else so the score get lost, and we converge to this vector of 0. Which is, which is a problem. So, what we will do now is actually talk about these two problems of spider traps and dead ends in, in more detail and develop solutions for them. So to summarize, there are two problems with how we defined PageRank so far. The first problem is the problem of dead-ends. So basically the idea is that dead-ends are these web pages that have no outgoing links. So what will happen is that the importance of these pages will link up, right? The idea is that basically whenever our web page receives its PageRank score and there is no way for our to pass this PageRank score to anyone else because it has no out-links, this PageRank score will leak out from our system. And at the end, the PageRank scores of all of the web pages will be 0 as we saw in the previous example. So that’s why this called the dead end problem. And then the spider trap problem, basically the idea is that here out-links from webpages can form a small group. So the idea is that the, the, if you think of a random walk interpretation of PageRank, basically the random walker would get trapped in a small part of the web graph, and then, the random walker will get kind of indefinitely stuck in that part. And at the end, those pages in, in that in that part of the graph will get very high weight, and every other page will get very low weight. So this is called the problem of spider traps. So what we will do next is we’ll actually develop our solutions to both of these problems. So let’s look at the two of the problems that we, that we just discussed in a bit more detail. So first is the problem of spider traps on a bit more complicated graph. So here is a variant of the graph we, we have from our initial investigation of PageRank with three nodes y, a and m, and in this case m is a spider trap which means m has this self loop. So whenever a random walker gets to know them, it basically gets stuck to this in this infinite loop because there is no other way out out of node m and all that m, the walker can do is infinitely walk the self loop from m. So now, think of the, the stochastic matrix m that we have here on the right. And the question is, what happens to the power iteration as, as, as, as we run it and multiply our r with the matrix m? So, here is the example. I have my vector r. I initialize it as we, as we said to 1 over the size of the graph. So, one third on every component and I start multiplying with matrix M. What will happen, is at the end, here is the result that we looked at. Basically the importance of node n will be 1 and the importance of both other nodes. Will be zero. If you think about the, the random walk interpretation of PageRank, this result is, is very much expected, right? If we think of a random walker browsing this string node graph, and we ask after lots and lots of time, where do with think the random walker will be? Basically the random walker will be stuck at, at, no-, at node n with probably 1. What this means is basically wherever the random walker starts, for some time then the random walker will be able to walk between nodes y, nodes y and a, but as soon as the walker crosses the edge to m it will be stuck in this infinite loop and it will never be able to move anywhere else. So this means that basically all the PageRank scores will be, will be concentrated at node m, which is what, exactly what happens. So, in some sense in this case, the PageRank’s, the page rank very nicely converged to some actor, but it converged to something that doesn’t make much sense, right? So all the, all the importance gets concentrated in, into this single node, and both nodes y and a have importance of 0. So, this is the problem of spider traps, is that they lead to results that are intuitive or not what we want. So now how to solve the problem of spider traps is to slightly modify our random walk way of thinking about PageRank, right? So the way Google solves the solution to the spider traps is to say that each step the random walker has two choices. With some para with some parameter beta, with some probability beta, the random walker will follow the, the outgoing link at random. So the same as the random walker was doing before, but with some remaining probability, the random walker will randomly jump to some other random web page. Right? So the way we can think of this now is that we have a random surfer, that whenever the random surfer arrives to a new page, flips a coin, with, and if the coin, this coin says yes, the random walker will pick another link at random and walk that link, and if the coin says no, the random walker will randomly teleport, basically jump to some other random page on the web, right? So this means that the tele, the, the random surfer will be able to jump out, or teleport out, from a spider trap within only a few time steps, all right? After a few time steps, the, the coin will say, yes, let’s teleport, and the random surfer will be able to jump up out of the, out of the trap. So, if we think about this in terms of the graph, here is our graph from before we node m being the spider trap. So what we can think of it, of, of this now, is that now we have this additional links that basically have with small probability, the random surfer can teleport out of, of any node a, any given time. So this means spider traps are no good problem anymore, so everything is good. The problem, we still have is the dead ends. So let’s understand the dead, dead ends problem a bit better. So, the problem with dead ends was the following. Was basically that these are the pages that have 0 out degrees. So, their PageRank score does not get distributed to any other paging the graph because they don’t point to anyone else. So going back to our 3, 3 node graph example in, in this case node n is the dead end because the out degree of n is 0. Okay? So now what we see if we look at the structure of our matrix m, the first thing we notice is that our matrix m is not stochastic anymore. Right? So our columns don’t sum to 1. In particular, the, the columns of, for node m do not sum to 1. The, the reason for that is, because node m has 0 out degree, so it will be all 0’s in that column of matrix m. So if we look at our set of equations what we used to have before was now is that basically the, the m does not r of m does not cor, does not appear in any of the equations. And now if we start, if we would go and run our Power Iteration. Here is, here is basically what happens. We again start with the vector of one-thirds. Keep, keep multiplying with m in the first iteration, second iteration, third iteration. And after a while all our, the, our vector basically converges towards 0’s. So basically it would say that all web pages in this graph have importance of 0. Which is, again, not what, not what we want. And basically the problem is that whatever is the PageRank score of, of node m, node m is not able to pass this PageRank score to any other node in the network so that PageRank score kind of leaks out of our system. So the question is, how do we solve, the, the problem we just observed with dead-end, with dead-ends? So, the, the way we solve the problem is to basically say the following. What we say is that if a node has no outgoing links, then when we reach that node, we will teleport with probability 1. So this basically means that for example whenever, whenever we reach node m we will always jump out of it random uniformly at random and tele, teleport somewhere else. So if you think, what do this to our stochastic matrix m and the column corresponding to node m. What happens is that basically now column 1 will have, will have values of 1 over 3 for all, all the, all its entries. What does this mean? It’s basically whenever a random surf, surfer comes to m, it teleports out, and with probability one-third, lands to any, any other node in the graph. So this is, again the way using the random jumps or random teleports, how we solve the problem of dead ends.

10. Why Teleports Solve the Problem (12:26)

So, now the question is why, why, why do, why do teleports solve all our problems, right? So before, here is the, the equation we had from before and in order to understand why the teleport solve all our problems, we have to go back to the theory of Markov chains that I alluded to in the previous lecture. So the way we define a Markov chain is the following. So Markov chain is this abstract mathematical object that has the following ingredients or parts to it. So first we think that we have set of states X, okay. Then we have a transition matrix P, where Pij simply measures what is the probability that if, if we were at state i, how likely are we to transition to state j in a given time stamp. Right so Pij means given that I was at j in previous step how likely am I transition to node to, to state i okay? And then pi is is a stationary probability distribution of being at any of the states x, and our goal right is to compute the value of this equation, that pi equals P times pi, right? So this would be a stationary probability distribution of this Markov chain that is defined over a set of states and with the transition matrix P. And again, you immed, immediately see the, the, the correspondence of our initial page link equation. Here we have pi equals P times pi, and we have r equals M times r. So, Theory of Markov chains says the following. It says that for any start vector the power iteration applied to this Markov transition matrix P will converge to a, to a unique positive, stationary vector. As long as this matrix P has three properties. It has to be stochastic, it has to be irreducible,and aperiodic. So now, what I will show is that for each of these three conditions, stochastic, irreducible, and aperiodic, actually adding random teleports gives us a, gives us a, in some sense stochastic transition matrix that has these properties. So what we will do now is, convince ourselves that our matrix M together with this random teleports, has all the three properties that we need for the for the PageRank r to exist. So the first question is, why do teleports make m stochastic? So a matrix is stochastic if its columns sum to 1. So in our case or in case of the dams when we have this node m that has no out-links, the column for node m did not sum to 1, it sums to and the specificity condition for the matrix was violated. So now if we add this random teleportation, we can that occurs with probability 1, we can basically think of this as adding, add, the green address from node m, to any other node in the network, including the [INAUDIBLE]. Right, this means that our matrix m now got transformed and our column for m, for node m now has these values of 1 over 3 in each so the column sums to one and we get the stochastic stochasicity property of matrix m. The way we can think about this in terms of equations is that basically we say our, we define a new matrix A where we take our previous matrix M, and now we, we introduce two pieces of notations here. First I have this vector A where the i-th component of vector A equals 1 if node I has out degree 0, if node I is a dead end, and otherwise if it has value 0. And then this vector E is just vector of all ones, so it’s a vector where every component has a value of 1. So what we basic, what this basically means is we take matrix M and wherever there is a column with in the matrix M that has all 0’s, we replace that with 1 over the out degree of, of that given node. Exactly what we, what we need in the case of m. So this is what we did now, is basically a random, a random teleportations. What they do, they take our matrix m that cannot be stochastic in the graph because it dead ends and transform into a new matrix A that is now stochastic. By, by taking the teleportation with probability 1 out of the nodes with 0 out degree. So that’s the first property. The second property is that m has to be a periodic. So we say that a chain is periodic if there exists some value k such that the interval between two visits to some state s is always a multiple of k. So for example, if we were to have a graph on three nodes with with a directed cycle, as I have it here, then for example, this a, this would, this would be a periodic, periodic chain, because the, the random lock here is deterministic and every, every two steps we return back to the, to the same node. So by adding teleports what this basically means is that that at any time we will be able to jump out of, of this kind of infinite, infinite loop. And we can even think that what we have is we have this self loop so that the, the random walker can, can get the, can spend some time at the given node and this way the periodicity is broken. And this is how basically random teleports solve the periodicity problem. Now the last property we need to talk about is irreducibly, and we say that m is irreducible when from, from any state there is a non-zero probability of going to any other state in the, in the network. This means that, basically, we can never get stuck in a given state. So the way, for example, we would make our given graph here irreducible is to add all these other, other possible links, which basically means we would add a random jumps. So this would mean that the, that there is a non-zero probability of going from any state to any other state in our graph. So putting all this together this is, this is exactly what random jumps do. So basically Google’s solution to, to PageRank and to random server interpretation of PageRank was to introduce random jumps. So the idea is that we state we want to take. Th, matrix M, make it, make it stochastic, aperiodic, in irreducible. All this is achieved by slightly modifying the, our random walking process, where at each step a random surfer has two options. With probability beta, the random surfer goes and follows a random outlink. And with probability 1 minus beta. The random surfer jumps to some other page at random. So now what this basically means is that this now changes our PageRank equation. So if you think about the PageRank equation now, it’s a bit different. So here for example, the score of node j can be computed as follows, right? So basically what this is saying is the following. The importance of node j is first. The sum of the importances of all the nodes i that point to it. Where r sub j, r sub i is the probability that random walker is at node i. Then we divide that by the outdegree of i as the probability that the random walker actually traverse the link towards j. And, this only happens with probability beta because the random walker, when they are at node I, has to decide to actually follow a link, and this happens with probability beta. And then of course, how likely is the random walker to visit node J? It either does it by by, by following a link. Or with probability one minus beta the random walker decides to jump. And if the random walker decides to jump then it will land at a given node J with probability one over N where N is the number of nodes in the network. Right? So basically we took our initial formulation of PageRank and now we change it a bit where we have the random walk part. This is the part where we kind of multiply with beta. And we have the random jump part. Where we have the 1 minus beta. So now the question is given this new random walk formulation. Is, is the power iteration still going to work right? Now we have a different more complicated recursive equation. So the question is how, how do we compute this? And the way we compute this is basically to, to run, to run our eigenvector finding method, our power iteration again. The way, the way we see that basically we have the same problem as before is to notice the following. So we have this, what we will call Google matrix, we will call it A. And we will express A as a matrix M plus some other matrices. So we take our matrix M and multiply it with beta. This is the, the part that comes due to random jumps. And then what we want to do is we have this other part, 1 minus beta, that basically this is the probabilities or transitions due to random jumps. And simply the expression you get here is that this is 1 minus beta, 1 over n, where n is the nodes in the graph, times the outer product of this vector of all, of all 1’s called e, okay? So what this means is that even with these random jumps the PageRank solution can be expressed exactly as we had it before. That r equals A, now this is the Google matrix not the matrix M anymore time, times r. Of course one question that we need to answer is what is a good value of beta right? How often should the random worker jump? For example is beta would be 0, then what would that mean is that the random walker jumps all the time, so all the nodes in the network have exactly the same prob the same PageRank score. Because the random walker is not really walking over the graph, it’s just randomly jumping all the time. If we set beta to be equal to 1, then basically there is no random jumps. And this means that our Matrix A wouldn’t be stochastic anymore, and, and so on. And we would have no random jumps and PageRank wouldn’t, wouldn’t really work. So what turns out is that the good value for beta is to set beta between 0.8 and 0.9. And usually people set beta to be 0.85. Which basically means for every five steps you do a random jump. So a random walker, in some sense, in, on the average would do five steps and jump, another five steps and a jump, and so on. So that’s basically the the idea. So, let’s now see how this PageRank formulation would work in, in the real world. So imagine we have our old graph as we had it before, three nodes. And in this case, node M is a, is a spider trap right? That is the self loop. What I also have on this graph is, I have these green edges. In these green edges, you can think of them as edges that are there due to, due to random jumps. So we have our matrix M, as we had it before. With matrix M everything is fine, it’s still, it’s still stochastic, the only problem is that node n is a, is a dead end. Oh, sorry, is a spider trap. And now what I also did in this graph is I labeled every, every edge with with it’s transition probability. So the way, the way we do now is we take this matrix M, we take this other matrix of one-thirds and multiply it with 1 minus beta, so in this case we are assuming beta is 0.8, and this gives us the mat, the matrix A. And now if we do the ma, multiply our r with matrix A, this is how the using power iteration, these are the different versions of vector r as we keep multiplying, and at the end the PageRank scores we would converge to are given here. So basically the score for node y would be 7 over 33. For node A would be 5 over 33. And for node M would be 20, 21 over 33. So what do we see? We see that m is still the most important node in the graph, because of this spider trap. But we see that nodes y and a have also non-zero score. And actually node a is more important than node y. So it seems everything works and everything is fine.

11. How we Really Compute PageRank (13:49)

So far we have been exploring PageRank and we have formulated it as this eigenvalue problem and we added the random walks and teleportation suite. So now the what we look next is actually how do we really compute PageRank for the web scale graphs? Basically how do we compute it for graphs that don’t even fit into the main memory of a machine? So let’s look at what we know so far. Right? What we know so far is that the key step in computing PageRank is this vector matrix multiplication where we are computing the new vector r by taking the the matrix M and multiplying it with the old vector r. right. This is very easy to do if we have enough main memory in the machine to, to store both a, the old version of r, and the new version of r. Now, let’s look at that is the structure of matrix A. And we will notice one thing that, that is kind of striking. Right, so here I have an example from our previous, previous slide with the three, three node graph where our matrix A equals the, the matrix M times beta plus 1 minus beta times this matrix of 1 over N’s, where N is the num, the size of the, size of the graph, and the size of this matrix is N by N. Right, so notice down here what happens. Right, we take our matrix M, multiply it with 0.8. We take this matrix of N by N where every entry is 1 over N. Multiply that with 0.2, and sum these to matrices these together. So what happens now it that our matrix A is this big, big matrix with non-negative entries. And what we notice is that every entry in this matrix is non-zero. So, initially our matrix N had a number of 0’s, but not our matrix A has no non-zeros. So what is, what is important now is that the amount of memory we will need to store A will be actually huge, right? A is now what is called a dense matrix. Let me just give you, give you an example. So imagine we have graph on 1 billion nodes. So basically we have 1 billion web pages. And 1 billion web pages is a very tiny fraction of the web graph. So we use a small graph from the web size point of view. And lets assume that we need, lets say, 4, 4 bytes to save, to save each entity of this graph. Right? So to save a node ID. So, this means that for example to store 1 billion pages and and address. This means that we will need around 2 billion 2 billion entries. And this, this means that we will need, need around 8 GB of memory to store matrix M, for example. But for example, if you would have to, want to store all the entries of matrix A now the size of matrix A is N by N, which is N squared. Which means that we would have to store N squared entries. And the amount of memory that we would need would be like, 10 to 18, which is a huge number and we would never be able to store this in memory. I, I think there is not even. If you take all the computers in the world, that is not, they don’t have enough memory to store 10 to the 18 bytes or integers in memory. So the question is what, what can we do? Is there a way to get around this, and kind of preserve the, the sparsity of the matrix M, right? Basically the idea is that in a real matrices M will be extremely sparse. Meaning that on, on, on average page only points to ten other page in the graph, right. So this means that rather than storing the whole, the whole, the whole vector of 0’s, and then only a few non-zero elements, we kind of only want to store non-zero elements. So here is, here is basically the idea and how we, how we now can solve the, the PageRank problem. So just to remind you, we, we right, we have the capital n number of pages or number of nodes in the graph. We are now define, we defined matrix M to that’s in such a way that the entry Mij is 1 over the degree of j, if node j points node i, and otherwise that entry is 0. And now let’s think about, how do random teleports play with this, into this matrix and right, so basically the way we can think of teleport is basically, we take the our initial graph and, and all these additional transition edges. Right? So we can think that random teleport’s basically take our graph and transform it into a completely connected graph where we have two types of edges. We have the edges that are that corresponds to the hyperlinks. And then we have the edges that correspond to random teleportations. Right? Whenever we can teleport from one node to another we add an edge. And we say that kind of a random surfer traverses this edge with a very small probability. And this is exactly what we will do now, right? So basically adding the teleport link from node j to every over page in the graph, and we want to set this teleportation probability of such teleport link to be 1 minus beta. Right, that’s the probability of a teleport times 1 over n. Right? This is the number of, the number of web pages so that’s the, that’s the probability of traversing one individual teleportation link. Right? This also means that what we need to do is now that on the all the other links of the web graph we have to reduce their, their transition probabilities right? So the transition probabilities go from one over the out degree of node j to beta times 1 over the out degree of node J. Right? So now basically what, what, what we did, is we, we took our graph and we and we in some sense transformed it into this fully connected graph, with different position probabilities over the edges that correspond to the hyperlinks and edges that correspond to teleports. However, what we notice is that this transformation is actually equivalent to saying that we will text the PageRank of every page by a fraction of 1 minus beta. And then we will take this 1 minus beta fraction of the PageRank scores, and we distribute it the right? Meaning that this random teleportation part of the PageRank score, gets evenly distributed among the webpages. So this is the, this is the intuition. So let me now show you how the mathematics works out. We will start, again, by looking at our initial PageRank equation, r equals A times r, where the way we define M, just to remind you is to take every entry ij of matrix i, is simply beta times the, the corresponding entry of matrix M plus 1 minus beta over N, where N is the size of the graph, right? The way we think about this is basically we say the transition of a random walker from, from node j to node i is simply the beta times the transition due to the random walk, plus a small constant, constant factor that happens due to a random jump probability transition, right? So now that we have this, let’s unpack our PageRank equation. If we unpack this PageRank equation, basically what we are saying N3i of the PageRank vector r is simply a, a summation over the, over the, over the j when j arranged for 1 to N, Aij times rj. Right? What we will do now is we will take this A and expand it into the, into the equation we have above. So if we do this, what, what we notice is the following, right? This is simply a summation, here is, here I’m expanding the definition of A now into how we computed matrix A. And I have my entry from the rank vector r. So I can distribute the summation. And here is we, what, what happens, right? So we have now two summations both over the same range. And what we observe now is one thing that we know that vector r is is a probability distribution, which means that the entries vector r sum to 1. Which means that this second summation here, the entries the sum over j of rij sums to 1. So the second summation basically simplifies to just this constant of 1 minus beta over N. So notice what, what we did right now. What we did is basically we expressed our r equals A time r into a different sum, in, into a different expression. Where we say that our r equals beta times M times r plus some some, some constant. Right? What this means is now basically that we never really need to explicitly express matrix A, right? We never really need to materialize this big, dense matrix where every entry is non-zero. We can only work with matrix M because matrix M is full of 0 elements, which we don’t need to explicitly store. So we can actually work with much, much smaller matrix. So this is basically the, the, the good thing that happened here. So, rather than working with a big n squared size, size matrix, we are now working on a, with a very small matrix. And let me just demonstrate how, how much difference this makes in practice, right? So what we did in the, in the few last slides, is basically we re-arranged out PageRank equation into a, into a different equation that doesn’t depend on a, but only depends on our matrix M. Right? So now, what is important is that matrix M is sparse. Right, what do I mean by that? Is for example, in the matrix M, if you think of it as a, as a matrix, in every, in every row. Even though this matrix is let’s say a billion times a billion, like 10 to the nine times 10 to the 9 the individual, the number of non-zero entries in every row on average will be around 10. What this means is that an average page has around let’s say 10 out-links or even 100 out-links. But the point being is that out of 1 billion entries in every, in every row of this matrix, there will be only 100 non-zero’s. So this means that rather than storing the full and squared number of cells in this matrix, we will only store the non, the cells that actually have the non-zero value. So what, what this means is that in, in every iteration of our PageRank equation basically all we need to compute in some sense is the, is the product of our matrix, matrix M with the, the, with our old rank vector, and then add a constant value of to every rank, rank score, just because of the random of the random jumps. Of course, here I was assuming that our graph M has no dead ends. If our graph will have no will have dead ends, this means the vector r won’t sum to 1, so after everything is done, we will have to kind of expand it back to normalize to 1. I’ll talk about this a bit more in the next few slides. So, here is now the complete algorithm for PageRank, how one would go and actually implement it in practice. So, the way this works is on our input we are given a directed graph G, and we are given a parameter beta. This is the teleportation parameter beta. Right? And what we are assuming in the algorithm is that our graph has both spider traps and dead ends. Kind of, nothing will break the algorithms robust, regardless of whether this dead ends and spider traps actually happen. So, the output of our algorithm is now a page link vector r, and the way the algorithm operates is the following. At the beginning we set all the, all the PageRank scores to be 1 over N, to be equal. And we set time equals 1. And then we have this loop where we iterate until our PageRank vectors will converge, which means that the PageRank vector between ti, time step t minus 1 and time step t, the, the in the the individual entries do no, do not change much, but do not, do not change more than some, some small value epsilon. Which is, and this epsilon is again a user set parameter, something small but and depending on how small this value is, this is a number of iterations that are going, and we need to converge. Okay, so now we explained the outer loop. So now, now let’s look what happens in the inner loop. So in the inner loop, first we have, we have another a full loop where basically we go and update the PageRank score of every node by simply taking the PageRank scores of every node i that points to it and then divide the by r degree of phi. This is the first part. Of course here we have to be careful if node j has gets the rank 0, if the n degree of it is 0, right? So, if a node has no n degree then we set its PageRank score to 0. And then of course what happens now is that if we have dead ends, the, the PageRank will leak out. So now we need to figure out how much of the PageRank has leaked out, and then reinsert the PageRank the missing PageRank scores into our rank vector. So, all we, what we do here is the following. We compute what is the sum of the components of our PageRank vector are so far, we call this s. We want this s will be less than 1, so what means is that 1 minus s amount of PageRank has kind of flicked out. So we want to now take this and, and evenly insert it into every entry of r. So this exactly what we, what we do. We say now we go over vector r again. We say the, the true value of the PageRank score is whatever we had before plus this missing part of the PageRank score. Right? 1 minus S over N. Which means that every node N gets what, 1 minus S times 1 over N fraction of the, of the leaked out score, so that now again our vector R will sum to 1. So with this, basically we obtained the new, new version of r, we check for convergence, and we keep we keep repeating until until we converge.

Study material

Course / Module 2: PageRank / Homework: MapReduce and PageRank

Course / Module 3: Locality-Sensitive Hashing / Outline of Module 3

README

12. Finding Similar Sets (13:37)

We’re now going to take up a class of problems where we’re given a large collection of sets, millions or billions perhaps, and we are asked to find those sets that are similar. The notion of similarity is quite specific and it’s called Jaccard Similarity. We’ll learn this concept soon, but the idea is roughly, that the larger the fraction of elements the two sets have in common the more similar they are. There is a fundamental problem of scale. If we have even a million sets not a large number compared with the number of web pages or Amazon users, the number of pairs of sets is half a trillion. We don’t have the resources to compare them all, so we need some magic defor, focus us on the pairs that are likely to be highly similar, never looking at the vast majority of pairs. When you learned about hashing you’re, it probably seemed like a bit of magic. You have a large set of keys, and when you want to find some key k, you go right to it without having to look very far at all. The technique we’re going to learn, locality sensitive hashing, is another bit of magic. Here we are pointed right at the similar pairs without having to wade through the morass of all pairs. We’ll begin by looking at some applications where finding similar sets is very useful. We then are going to focus initially on finding similar documents, meaning that they have a substantial amount of text in common. For this problem we first study shingling, which is a way to convert the informal notion of similar documents into a formal test for similarity of sets. Then we learn the remarkable technique called min hashing, which allows us to replace a large set by a much smaller list of values. The magic of min hashing is that the similarity of the small lists, called signatures, predicts the similarity of the whole sets. Finally, we take up the locality sensitive hashing technique itself and see how to find similar sets or similar documents without doing anything that involves searching all pairs. To begin, let’s look at some of the interesting data mining problems that fit the pattern of mining for similar sets. For example, we can view web pages as the set of words they contain. If two pages have similar sets of words, we might expect them to be about the same topic. For another example, imagine a matrix of Netflix users where the rows correspond to the users, and the columns to the movies. The entry for a given user and movie is the rating that the user has given the movie, blank if no rating. We might see a user as the set of movies they have rated four or five, that is movies they like. Two users who have similar sets of liked movies probably have the same tastes, and Netflix can use the movies one user said they liked to recommend movies to the other. We can use the same idea backwards. Where we think of a movie as the set of users who like that movie. Movies with similar sets of users can be expected to belong to the same genre of movie. People create records of data about themselves at many different sites, Google, Amazon, Facebook and so on. We may want to figure out when two records refer to the same individual, and this need gives rise to the problem called entity resolution, determining the set of records that refer to the same individual. To see the problem, many sites will ask for phone number. But you might give your land line at one site, your cell phone number at another, not give a number at all at a third site, and mistype your number at a fourth. However, we can often wade through the errors and ambiguities by thinking of a record as a set of attribute value pairs. Pairs like, oh, phone is 555, I don’t know, whatever okay? Records with similar even if not identical sets of attribute value pairs may well represent the same individual, and these records can be merged to combine their information. We’re going to focus on a particular important application, finding lexically similar documents in a large collection of docs such as the web. Note we are not talking about docs on a similar topic. We want them to have sequences of characters in common. This question has a variety of applications. For example, the techniques we’re going to learn were used to find mirror pages on the web. Mirror pages are typically almost the same, but they will differ, for example, in the information about the host site for the page and links to the other mirrors. Search engines use a technique like the one we’ll learn so they don’t show more than one of a set of mirror sites. Another application of finding lexically similar documents is to search for plagiarisms. For example, spammers will take your webpage, give it a new URL, and place ads around it. The plagiarizer may be clever, taking only a part of the plagiarized document, reordering pieces, perhaps changing a word here and there. We still want to be able to find such pairs of documents in a collection as large as the web without having to compare all pairs of documents. It can be done. In fact, it’s much easier than it looks. And another application concerns sites like Google News that aggregate new stories. An article may be written by the Associated Press and distributed to thousands of newspapers and online news sites. Each will make modifications, perhaps truncating the story, surrounding it with ads, and so on. It’s important for an aggregator to realize that the two web pages are really telling the same story because they came from the same original even if they have been significantly modified. As we suggested in the introduction, we’re going to learn three important new techniques. Shingling is how we convert documents to sets so that documents that have a lot of text in common will be converted to sets that are similar in the sense that they have a lot of members in common. Then we’ll learn about minhashing, which is how we convert sets to short signatures. The important property is that we can look at the signatures of two sets and tell approximately how similar are the sets that we obtained by the shingling process. And last but not least, we’ll learn the technique called Locality-sensitive hashing or LSH that let’s us avoid looking at most of the pairs of signatures that do not represent similar sets. Here’s an outline of how we’d process documents to find those that are similar without comparing all pairs. At the outset, I want to emphasize that there can be both false positives and false negatives. That is, the algorithms we use can sometimes fail to find a pair of documents that we would regard as similar. That’s a false negative. We can also, if we’re not careful to check the details of the document, sometimes have false positives. Pairs of documents we declared to be similar, but they really aren’t. However, by carefully choosing the parameters involved, we can make the probability of false positives and negatives be as small as, as we like. Okay so, we start by shingling the document. Okay, that is we replace the document by the set of strings of some chosen length, k, that appear in the document. That’s how we convert documents to sets. We then construct signatures for the sets of single, shingles using the technique called minhashing. The result of minhashing a set is a short vector of integers. The key property, which we’ll prove, is that the number of components in which the, two of these vectors agree is the expected value of the similarity of the underlying sets. Incidentally, the reason we want to replace sets by their signatures is that the signatures take up much less space. If we’re dealing with a large set of documents, we’d like to be able to work in main memory rather than with disc for efficiency, and reducing the space of the representations makes it more likely that we can work in main memory. But it seems we still need to compare all pairs of signatures, and that takes time that is quadratic in the number of documents. As we mentioned, even a million documents leads to half a trillion pairs of signatures to compare, and that is too much. So that’s where locality sensitive hashing comes in. We do some magic, which we’ll explain soon, that allows us to look at a small subset of the possible pairs, and test only those pairs with similarity. By doing so, we get almost all the pairs that are truly similar, and the total time spent is much less than quadratic. So lets begin the story by finding exactly what shingles are. For any integer k, a k-shingle is sometimes called a k-gram, is a sequence of k consecutive characters in the document. The blanks that separate the words of the document are normally considered characters. If the document involves tags such as an HTML document then the tags may also be considered characters or they can be ignored. So here’s an example of a little document consisting of the five characters abcab. We’ll use k equals two for this little example, although in practice you want to use a k that is large enough that most sequences of k characters do not appear in the document. A k in the range five to ten is used, generally used but the two shingles for our little document are ab, which is that, then bc, then ca, and then ab again. Since we’re constructing a set, we include the repeated two shingle ab only once, and thus our document abcab is represented by the set of shingles ab, bc and ca. We need to assure ourselves that replacing a document by its shingles still lets us detect pairs of documents that are intuitively similar. In fact, similarity of shingle sets captures many of the kinds of document changes that we would regard as keeping the documents similar. For example, if we’re using k-shingles and we change one word, only the k-shingles to the left and right of the word as well as shingles within the word can be effected. And we can re-order entire paragraphs without affecting any shingles except the shingles that cross the boundaries between the paragraph we moved and the paragraphs just before and after in both the new and old positions. For example, suppose we use k equals three, and we correctly change the which in the sentence to that, the only shingles that can be affected are the ones that begin at most two characters before which and end at most two characters after which. Okay, these are g blank w, blank wh, and so on, up to h blank c. A total of seven shingles. These are replaced by six different shingles. G blank t, t is it blank th, and so on up to t blank c. however, all shingles other than these remain the same in the two sentences. Because documents tend to consist mostly of the 26 letters, and we want to make sure that most shingles do not appear in a document, we are often forced to use a large value of k, like k equals ten. But the number of different strings of length ten that will actually appear in any document is much smaller than 256 to the tenth power or even 26 to the tenth power. Thus, it common to compress shingles to save space while still preserving the property that most shingles do not appear in a given document. For example, we can hash strings of length ten to 32 bits or four bytes, thus saving 60% of the space that are needed to shore, to store the shingle sets. The result of hashing shingles is often called a token. Thus, we can construct for a document the set of it’s tokens. We construct the shingle set and then hash each shingle to get a token. Since documents are much shorter than two to the 32nd power byte, we still can be sure that a document is only a small fraction of the possible tokens in it’s sets. There’s a small chance of a collision where two shingles hashed to the same token, but that could make two documents appear to have shingles in common when in fact they have different shingles. But such an an occurrence will be quite rare. In what follows we’ll continue to refer to shingles sets when these sets might consist of token rather than the raw shingles

13. Minhashing (25:18)

We’re now going to forget whether the sets we deal with come from single documents or any other source and concentrate on the sets themselves. We’ll learn the formal definition of similarity that is commonly used for sets. This notion is called Jaccard similarity. We’ll then learn how to construct signatures from sets using the menhashing technique. And will prove the strong relationship between the similarity of the signatures and the sets they represent. Let C1 and C2 be two sets, [][their Jaccard similarity is the size of] [][the intersection of these two sets, divided by the size of the union.] We’ll use Sim as the function representing the Jaccard similarity. For example, these two circles represent sets. There are three elements in common to both sets. So, the size of their intersection is three. And there are eight elements in the union, so the size of their union is eight. The Jaccard similarity of these sets is the ratio of the sizes of their intersection and un, and union or three eighths in this example.

We’re going to be dealing with large collections of sets and it is useful to think of these collections as represented by a single Boolean matrix, even if the collection is not likely to be stored that way. First, we assume that there’s a universal set, from which the elements of all sets are drawn. [][For example, if the sets come from k-shingling documents,] [][then the universal set is the set of all possible sequences of K characters or] [][the set of all tokens if we hash the shingles.] Each element in the universal set is represented by a row of the matrix. And each set in the collection is represented by a column of the matrix. The matrix has one in the row for element E and the column for S, if and only if E is a member of S. Otherwise that entry is zero. The column corresponding to a set as the characteristic vector of the set S. The vector with ones only in the positions that correspond to the members of S. We shall often talk about the Jaccard similarity of 2 columns. From each column, form the set represented by the column, except consisting of the rows where the column has 1. Then the Jaccard similarity of the two columns is the Jaccard similarity of the sets they represent.

It is important to note that in typical applications, the matrix is very sparse. It has many more zeros than ones. [][For example, we choose k for k shingling, so] [][that documents have relatively few of the possible shingles.] [][We translate into columns having many more zeroes than ones.] For another example, suppose the matrix represents the books bought by Amazon customers, rows are the books and columns are the customers. And customers are similar if they buy many of the same books. Typical customer buys only a tiny fraction of the books Amazon sells. So again, we would expect our matrix to be very sparse. Here are two columns, C1 and C2. They’re not sparse, because it’s hard to do small examples when most entries are zero. However, the calculation of their Jaccard similarity is simple. There are two rows where they both have one, so the intersection of the sets they represent is of size two. And there are five rows where at least one of the columns has one, so the size of the union of the represented sets is five. Thus the Jacquard similarity is two fifths of 40%. In general, you can compute the similarity of two columns, by counting the number of rows where both have one, and dividing by the number of rows in which one or both have one.

Our goal is to describe how min hashing of sets or matrix column works, and to show that we can deduce the similarity of the sets or columns by looking at the signatures that result from in hashing. Our first step will be to observe that given two columns we can find four different kinds of row, depending upon which bits are present in that row. For example, type a row has one in both columns. Notice that if the matrix is sparse, most of the rows will be of type d with zeros in both columns. I find it useful to abuse the notation and use a, b, c and d also as integers representing the number of rows of types A, B, C and D in the matrix. We can express the Jaccard similarity of two columns in terms of the counts of the row types. That is the similarity of columns C1 and C2 is A, divided by A plus B plus C. The reason is that A is the number of rows in the intersection, and A plus B plus C is the number of rows in the union. We’re now going to define Minhashing. [][Each Minshashing hash function is associated with a permutation of] [][the rows of the matrix.] [][We don’t physically permute the rows, that would take much too much time.] [][We just imagine that the rows are permuted.] The definition of the minhash function h, associated with a permutation is, is that h of a column C is the number of the first row in the permuted order, in which that column has 1. To create a signature for each of the columns of the matrix, we pick some number. About 100 is often a good choice of permutations. And use their associated Minhash functions, say H1 through H100. For each column, the signature is the sequence of row numbers we get when we apply each of these Minhash functions in turn to the column. [][It is important to remember that for] [][the entire matrix or collection of sets, we select the Minhash functions once and] [][apply the same Minhash functions to each of the columns.]

We can think of the signatures as another matrix. The columns of the signature matrix correspond to the columns of the original matrix, that is, to the sets in the collection. [][While each row in the signature matrix is the result of applying one of] [][the chosen Minhash functions to each of the columns.] Let’s look at a little example that can make things clearer. Here’s an example matrix with four columns and seven rows. Okay, and here’s a random well, quote random permutation of the rows. The fifth row is the first in order, that’s this. And the sixth row is next, and the top row is third and, and so on. We construct the first component of the signature for each of the columns using this permutation. We start with the row ordered first, that is row 5. This. And this row has one in the second and fourth columns. And thus we gave columns two and four, their first Minhash value, it is one, and that appears here. Okay, because the first row in the permuted order is surely the first in that order to have a one in these columns. We still don’t know about the 2s in the first row of the signature matrix. These this. We’ll discover those next. So now, we proceed to row 6. This, which is the second in the permuted order. And this row has 1s in column 1 and 3. It happens that neither of those columns has been assigned a value yet, because we haven’t encountered a row in which either of those columns have 1. But they both get the value 2. Because the second row in the permuted order, but not the first row in that order has 1 in each of these columns. [][In principle, we have to proceed down the list of rows in the permuted order.] But since we’ve discovered the Minhash value for each column, there’s no point in doing so.[][Why? cause we are random picking/experiment labled ball/balls from a shaked box?] Here’s the second, quote, random permutation, and it’s resulting row of the signature matrix. In this permutation, row 3 comes first. It has a 1 in the second and fourth column, so the second row of the signature matrix gets 1 in those columns. Okay. Now look at the second row in this order, which is row two. It has one in columns one and four. We can’t assign value two to column four because we already have a value one. But we don’t yet have a value for column one. So, we assign it, the value 2 as its Minhash value in the, in the second Minhash function. We still don’t know the value for column 3, because neither of the two rows examined so far, have 1 in that column. So, we proceeds to the third row in the permuted order, which is row 4. And it has one’s in columns two and four, but both these columns have smaller values already, so we’re still not done. So we move on to the fourth row. It happens to be the top row here. And now we find finally, a one in column 3. So, the Minhash value for that column is four. Okay, and now we’re done with this Minhash function. He, here’s a third permutation and the resulting role of the signature matrix. I’ll, I’ll leave it to you to study the matter and work out, why the Minhash. Now, the reason we like mean hashing is a way to summarize answers expressed by the following remarkable property. Suppose we consider all possible permutations of the rows and ask, for what fraction of the permutations would the mean hash values for the two columns, C1 and C2, be the same? It turns out this probability is exactly the Juccard similarity of the columns or the sets they represent. Okay, now, here’s a simple proof of this fact. Both the probability and the similarity are a over a plus b plus c. We already know that the Jucard similarity of columns is given by that formula. So, why is the probability of the Minhash values being the same also given by A over A plus B plus C? Imagine the rows are commuted in a random order, and imagine going down the two columns in this order, let’s see here, C1, here C2. Since most entries are zero, we’ll probably need a lot of type d rows zero, zero, zero, zero zero, and so on. Okay, and eventually we’ll come to a row where at least one of the columns has a one. So, let’s suppose here’s a one. Now, if we came first to a type A row, then the MinHash values for the columns would agree, because we’d have a one here, okay? Okay, and they would both get this row as the as their MinHash value. If we come to a type B or C row first, or let’s say there’s a zero here, then one of the columns, the first one with the one gets this row as the, as it’s meant hash value. But the other column will have to wait until we see a one. So, it’s definitely going to get something higher, and they will not have same MinHash value. Thus the probability that the two columns will have the same MinHash value, is the probability that the first row that isn’t of type-d is a type-a row. That probability is the number of type-a rows divided by the number of rows of any of the types a, b or c. That is, A divided by A plus B plus C. Armed with this observation we can sensibly define the similarity of two signatures. It is the fraction of the Minhash functions for which the two signatures have the same value. It follows that the expected value of the similarity of two signatures is the Jaccard similarity of the underlying sets. Moreover, as we use more and more minhash functions, the standard deviation of the signature similarity goes down. So, if we use several hundred Minhash functions, that is, signatures of several hundred components,. We get a small enough standard deviation that we can estimate the true Jaccard similarity of the represented sets to within a few percent. That is good enough for most data mining purposes. Let’s revisit our example of computing signatures of length three from this matrix. Let’s look at some of the signature similarities and the actual column similarities. Remember that similarity means different things for columns and signatures. For columns or sets it is the jacard similarity, while for signatures it is the fraction of components in which the two signatures agree. So let’s look at columns one and three and their corresponding signatures. Yeah, the Jaccard similarity of the two columns is three fourths. Notice that there are four rows where at least one of these two columns is 1. That is here, here, here, and here. And in all of this one, they both have one. Thus the size of the intersection is three, and the size of the union is four. Now, look at signatures one and three. They agree for the first and third Minhash functions. But they disagree on the, the second. Thus the si, signature similarity is two-thirds. Now two-thirds is pretty close to three quarters, but there is some discrepancy as we note, here. If we look at columns two and four [NOISE] We again find the Jaccard similarity is three-quarters. But here the similarity of the signatures is 1. They are in fact identical in all three components. Another int, interesting example is columns 1 and 2. These columns have an empty intersection. So their Jaccard similarity similarities zero. [][It turns out that when the similarity is zero it is impossible for] [][any min hash function to return the same value for these two columns.]the value used in previous 1 will not be used in later 1 As we see again in the white table. Thus the similarity of these signatures is zero as, as it, as it must be.

Remember that we’ve defined minhashing as if we’d actually permuted the rows. But it is not really feasible to do so. So let’s, consider, data of modest size where there are a billion rows. First of all, takes a lot of time to pick a random permutation of a billion things. You essentially have to generate a build a billion random integers and do something with each, and representing a random permutation of a billion items takes at least, four gigabytes of space. If we had, say, 100 random permutations, then that’s four tenths of a terabyte just to store the permutations. Okayt, And if you try to access the rows of the matrix, according to the order of one of these permutations, then you’ll have to do many disk accesses to get each row. And that’s incredibly time consuming. Here’s how we simulate permutations without actually permuting rows. For each main hash function pick a normal sort of hash function that hashes integers to some number of buckets. We pretend that the position of row R in the permutation is H of R where H is the hash function. so, for each column, we’ll look for that row r, in which the column has a one and for which h of r is the smallest. More specifically let’s pick some number of ordinary hash functions, say 100 hash functions. One for each Minhash function we want to simulate. Okay?

[][For each column c, we keep a slot for each of the hash functions.] [][Call the slot for column c and the ith hash function m of i and c.] If we want a 100 minhash functions, then the number of slots is 100 times the numbers of columns. Our goal is that eventually M of i and c will become the smallest value of h sub i of r. For which column c has a 1 in row r. That is, we suppose that the ith min hash function orders rows by the value to which h sub i sends each row. Notice that this order is not exactly a permutation. It’s Entirely possible that h of i, h sub i, maps two or more rows to the same designation. But if we make the number of buckets into which h of i hash is very large, larger than the number of rows, then the probability of a collision at the smallest value is very small, and we can ignore the probability of a collision.[][the prime mod hash functions] So here’s the algorithm in a nutshell. [][The outer loop is on the rows.] Okay. [][For each row r, the first thing we do is compute each of the,] [][perhaps, hundred hash values, h sub i of r.] That’s this. [][Then, we’ll loop over all the column c, and] [][if column c does not have a one in row r then we do nothing for r and c.] Okay, but, now [][suppose matrix m has one in row r in column c. [][Then we’re going to loop over the index I.]hash function i [][For all the hash functions, and for] [][each of these perhaps hundred values of I, we check whether H of S of R is smaller.] [][Then the smallest value currently in the slot for] [][the hash function, for, hash function i in column c,] [][see, if that is the case then we replace that slot by, h by r.]

We take M of i and c to be infinity initially. So the first row in which we find that has a won in column c surely is placed in that slot, [][Also note that it is important we compute h survive r only once for] [][each hash function in each row.] Outside the loop over the columns, that’s, that was, that was this.

So, let’s do a little example. Our matrix has only two columns and five rows, that’s, that’s this, We’re going to use two hash functions, that is we compute signatures of length 2. The two hash function, functions that we use are, are shown here. Each maps integers to five buckets. The, the first which we call h of x, maps any integer X to X marginal 5. That is the remainder when X is divided by 5. The second G of X computes a 2X plus 1, and again takes that modual 5, takes the remainder of 2X plus 1 mod 5. Okay, we are ready to compute the two components of the signatures for each of these columns. Remember that initially, we’ll assume all slots are infinity. Begin by looking at the first row. And we find h of one is one and g of one is three. modulus. Take, take them all modulus five, so three modular five is in fact three. now, row one has one in the first column, but zero in, in the second column. Therefore the second signature is not changed, and both its components remain at infinity. But the first signature is changed to the values of h of 1 and g of 1 that is 1 and 3. Okay, now, consider the second row, h of 2 is 2 and g of 2 is 5 modular 5, or 0. Since column 1 has 0 in the second row, we do not change its signature. But column 2 has 1 in row 2, so we replace the infinite values in its signature by 2 and 0. Next, the third row, H of three is three, and G of three is seven, modular five, which is two. There is one in row three of both columns, so both signatures are candidates for being lowered. However, H of three is three in the first components of both signatures are already lower, one and two respectively So, we do not change either first component. Now g of 3 is 2, so we might change either second component. For the first signature the current value is 3. So we lower it to 2. But for the second signature, the signature, the current value is already zero. So we leave it at zero. H of 4 is 4 and G of 4 is 9 modular five, which is four. And since four is larger than any of the current slots for the first column, no changes are made. Finally, h of five is five modular five or zero. And g of 5 is 11 modulo 5, or 1. Only the second column has a 1 in row 5, so we can only change its signature. Since h to 5 equals 0, and the old value of the slot for h is 2. We change it to zero. But the slot for g already has zero, which is lower than g of five, which is one. So, no change is made there. Thus, the final symmetry is r one two for the first column. And zero, zero for the second column. Incidentally notice that the two signatures disagree for both components, so they estimate the Jaccard similarities of the columns that are zero. That’s off by a little since, as you can see the true Jaccard’s similarity of the columns is one fifth.

[][One last detail is worth mentioning.] The algorithm we, we describe as soon as we can visit the matrix row by row. But often the data is available by columns and not by rows. For instance, if we have a file of documents, it’s natural to process each document once, computing its shingles. That, in effect, gives us one column of the matrix. [][If so, we need to do one preliminary step, sort the data, so it is organized by row.] That’s not hard. [][Start with a list of row column pairs where the ones are.] Initially sort it by column, and sort these pairs by row.

think about it

14. Locality-Sensitive Hashing (19:24)

Now we’re ready to learn and apply the idea of locality-sensitive hashing. We’re going to do this first for the special case of minhash signatures and later see the general LSH idea.

First, let’s remember where we’ve gotten so far. We converted documents to sets of shingles and then we converted the presumably large sets of shingles to short signatures, consistently a vectors of integers.[][Hashing then Minhashing] We can compare two signatures as they make quite close to the Jaccard similarity of their underlying sets. Since the signatures are relatively short, we can fit many of them into main memory at once and thus compare many different pairs of these signatures without having to spend the time needed to read each signature from disk many times.

The idea behind LSH is to look at the collection of elements, that is, signatures in our example here, whose similar pairs we want to find and without constructing all pairs of those elements, create a short list of candidate pairs whose similarity actually must be measured. [][When constructing candidate pairs,] [][you look only at individual elements, not at the pairs themselves.] All pairs that are not candidates are assumed not to be similar even though in rare cases, there will indeed be false negative. That is, pairs that are similar but never checked for similarity. [][For the cases of signature matrices,] we perform LSH by creating some large number of hash functions. These are ordinary hash functions, not minhash functions. [][For each selected hash function, we hash columns to buckets.]first generate column-num of buckets, then hash combinations to pair For each bucket, we make all pairs within that bucket a candidate pair. A pair becomes a candidate pair if any one or more of the hash functions puts both signatures in the same bucket.

Okay. [][We need to tune the number of hash functions and the number of buckets for] [][each hash function so that the buckets have relatively few signatures in them.] That way, there are not too many candidate pairs generated. But we can’t use too many buckets, or else, pairs that are truly similar will not wind up in the same bucket for even one of the hash functions we use. To start, we have to agree on how similar is similar. We pick the threshold t that is the minimum value of Jaccard similarity for us to regard a pair of signatures as similar. That is, in the ideal world, columns c and d of the signature matrix M would be a candidate pair if and only if their similarity was at least t. Remember that the similarity of signatures is the fraction of components or rows of the signature matrix M on which they agree.** So we want columns c and d to be a candidate pair if the fraction of rows i for which m of i and c and m of i and d are the same to be at least t. So we need to create some number of hash functions and use each to hash the columns of signature matrix M into buckets. And we need a trick to make sure that similar signatures, or columns, are much more likely to hash to the same bucket for one of these hash functions than if the signatures are dissimilar. As we mentioned before, we’re going to regard a pair of signatures as [][a candidate pair if even one of the hash functions puts them in the same bucket.]

So, here’s the picture of how the hash functions are created. The yellow area is the signature matrix M. Each column corresponds to one signature and each row is one of the components of all signatures. That is, each row was created by applying to each of the underlying sets one of the minhash functions we use to create the signatures in the first place. We divide the rows into b bands for some number b. As a result, there are r rows per band, where b times r is the total length of the signatures. That is, the number of main hash functions we use to create the signatures. We’re going to create one hash function from each band. Remember, we divided the signature matrix M into b bands of r rows each. [][From each band, we create a hash function.] This hash function hashes the values that a given column has in that band only. [][Ideally, we would make one bucket for] [][each possible vector of b values that a column could have in that band.] That is, we’d like to have so many buckets that the hash function is really the identity function, but that is probably too many buckets. For example, if b equals 5 and the components of a signature are 32-bit integers, then they would be 2 to the 5 times 32, or 2 to the 160th power of buckets.[][Think here] We can’t even look at all these buckets to see what is in them at the end. So we’ll probably want to pick a number of buckets that is smaller, say, a million or a billion. As we said, we consider a pair of columns and signatures to be a candidate pair if they are in the same bucket according to the hash function for any of the bands. Put another way, [][the only way we can be sure a pair of signatures will become a candidate pair] [][is if they, if they have exactly the same components in at least one of the bands.] Notice that if most of the components of two signatures agree, then there’s a good chance that they will have 100% agreement in some band. But if they have few components in common, then they are unlikely to agree 100% in any band. We’ll make the mathematics more precise shortly, but that’s the intuition.

Given t, the threshold Jaccard similarity needed for pairs to be considered similar, we need to tune b and r so that most of the similar pairs are 100% similar in at least one band. But few of the pairs with the Jaccard similarity less than t are 100% similar in any band. The only constraint we have is that b times r has to equal the length of the signatures. That is, equal to the number of minhash functions we used to create the signatures in the first place. In, intuitively, if we make b large and r small, then there are lots of bands and therefore lots of opportunities for a pair to wind up in the same bucket. And since r, the width of the band, is small, it’s not hard for a pair to hash to the same bucket for one of the bands. Thus making b large is good at the similari, if the similarity threshold is relatively low. conversely, if you make b small and r large, then it would be very hard for two signatures to hash to the same bucket for a given band and there a few bands that give them the opportunity to do so. Thus a small number of bands is best if we have a high threshold of similarity. Again, we’ll make the math precise shortly.

Before we go on, here’s a picture of what one of the hash functions for LSH on signature matrices looks like. We see one of the b bands, the band consisting of r rows, of course. Oh, we also show the matrix that’s consisting of several, [][of seven columns or signatures.] And each of the purple rep, rectangles represents the portion of its column within the one band we focus on. Now, columns six and seven hash to different buckets. Thus, they surely differ within this band, so we are not motivated to compare them for similarity. That is, the pair sex and seven is not made a candidate pair by this LSH hash function. Perhaps column six and seven will hash in the same bucket for some other hash function and will then therefore become a candidate pair from whoa. But from what we can tell, looking only at this one hashing, they do not form a candidate pair. On the other hand, columns two and six do hash to the same bucket. So two and six is a candidate pair regardless of what happens in the other bands. There’s a good chance that columns two and six are identical within the band shown. That is these pieces of their columns. Are identical. There’s a small chance that these segments of these columns are not identical, but they just happen to hash to the same bucket. We will generally neglect that probability as it can be made tiny, [][like 1 in 4 billion, if we use 2 to the 32nd power buckets.] Let’s look at a particular example to get a feel for how the probabilities of false positives and negatives work out in practice. We’ll assume there are 100,000 columns. That is, we’re looking for similar documents among a set of 100,000 documents. We’ll assume signatures are of length 100. That is, we use the 100 minhash functions to create the signatures. The signature matrix M is thus 100 rows by 100,000 columns. Notice that the signatures fit very nicely in main memory. Assuming the components of a signature are 4-byte integers, each signature takes 400 bytes and the total space requirement is 40 megabytes. Now, let the similarity threshold be 80%. That is, we consider a pair of signatures similar if and only if they agree in at least 80 of their 100 components. There are approximately 5 billion pairs to compare so we’d like to use LSH to avoid having to compare them all. Incidentally, if you don’t see why 5 billion is the approximate count of pairs, the exact number of pairs of items chosen from a 100,000 items is a 100,000 choose two. Which is a 100,000 times 99,999 divided by 2. And if we approximate the five 9s by a 100,000, we get exactly 5 billion.

In our example, we’re going to divide the 100 rows of signatures ma, of the signature matrix into 20 bands with five rows each. First, let’s consider two columns, C1 and C2, that represent sets with Jaccard similarity 0.8. Notice that because of the randomness involved in minhashing, the columns C1 and C2 may agree in more or fewer than 80 of their rows, but they’ll most likely have approximately 80 equal rows. [][Now, what is the probability that these columns are 100% similar in] [][one given band?] [][Well, the probability that they agree in any one row is exactly 0.8.] Remember that the probability that a minhash function agrees on two sets equals the Jaccard similarity of the underlying sets. So the probability that the two columns agree in all five of the rows of a band is 0.8 raised to the fifth power, or approximately 0.328. That’s not very high probability, but we have 20 chances to make the pair of columns a candidate pair. The probability that they do not hash to the same bucket in one band is 1 minus 0.328 or 0.672. Okay. But the probability that the columns failed to hash to the same bucket for any of the 20 bands is that value 0.672 raised to the twentieth power, which is a tiny number. It’s actually this 0.00035. The chance that pair C1 and C2 will be a candidate pair is 1 minus that, or 0.99965. Put another way, the probability of a false negative,[][False Negative: check wikipedia and think] a pair of sets that have Jaccard similarity 80%, but whose signatures do not become a candidate pair is 0.00035, or about 1 in 3,000.

Now look at a pair of sets that have Jaccard similarity 0.4. The probability their signatures are identical in a given band is 0.4 to the fifth power, or about 1%. The probability that their signatures hash to the same bucket in at least one of the 20 bands is surely no more than 20 times that, or 20%.[][1-(1-0.45)20] that’s, that’s not great. It means that among 40% similar underlying sets, there are 20% false positives, pairs of signatures we will have to compare and[][False Positive: (0.20.6)/(0.20.6+(1-0.2)*0.6)] yet will find that they’re not at least 80% silar, similar.

But 20% false positives is bad, but the false positive rate falls rapidly as the similarity of underlying sets decreases. For example, for 20% Jaccard similarity, we get less than 1% false positives.[][1-(1-0.25)20, (0.80.00638058)/(0.80.00638058+(1-0.00638058)*0.8)][][FPR = FP/(FP+TN): where FP is the number of false positives, TN is the number of true negatives and N = FP + TN is the total number of ground truth negatives. ] We cannot determine the exact number of false positives because that depends on the distribution of Jaccard Similarities among the underlying sets. For example, if most pairs of sets were 79% similar, almost all would be false positives. But if the typical pair of sets has a Jaccard similarity of a few percent, then there would be almost no false positives.

A way to look at the problem of designing an LSH scheme from a minhash matrix is this. We want the probability of two columns sharing a bucket to be a step function with threshold t equal to the value at which we regard the underlying sets similar. That is, if the Jaccard similarity s of the underlying sets is less than t, we want there to be zero chance the signatures will share a bucket for one of the hashings and thus become a candidate pair. However, if the underlying Jaccard similarity exceeds t, we want the pair of signatures surely to become a candidate pair. On the other hand, what does a single row of a signature matrix gives us? It gives us a straight line. The justification is the theorem about the probability of two minhash values equaling the Jaccard similarity of the underlying set. That’s not too bad. At least the probability goes in the right direction, but [][it does leave a lot of false positives and negatives.]I think false positive on the top and false negative on the button That is, for a given threshold t, all of these are false positives andHow does this make sense??? all of these are all false negatives.

But when we combine many minhash functions into b bands of r rows each, we begin to get an s curve shape with greatly reduced false positive and negative regions. We’re going to derive the function that relates the probability of two sets having the signatures become a candidate pair to the similarity s of the sets. First, if the underlying sets have Jaccard similarity s, then the probability that their signatures will be identical in all r rows of one particular band is s to the r. So the probability that their signatures will not be equal in this band is 1 minus s to the r. And the probability that their signatures will be unequal in each of the b bands is that raised to the bth power. Finally, the probability that their signatures will agree in at least one band is one minus that, or one minus the quantity, one minus s to the r all raised to the bth power. Okay? [][As b and r get large, this function increasingly resembles a step function.]How can I see that??? And the threshold at which the rise occurs is approximately 1 over b raised to the power of 1 over r. For example, in the case of b equals 20 and r equals 5, the threshold will be approximately the fifth root of 120th, which is about 0.55. Here are some sample values of this s curve for the case we have been examining, 20 bands of five rows each. It’s not exactly a step function, but it does get rather steep in the middle. For example, looks at the values between 0.4 and 0.6. The rise from 0.4 to 0.6 is more than 0.6, so the average slope in this region is over 3. On the other hand, in the region 0 to 0.4, the rise is less than 0.2 and the same can be said for the region from 0.6 to 1. That is, the slope is less than one-half in both these regions. So a rough approximation to this curve looks like, like this. Okay, it’s not exactly a step function, but much better than the linear function we got from a single row.

So here’s a summary of what we need to do to find sets with a given threshold t of Jaccard similarity. First, we need to decide on our values of b and r. As we mentioned, the threshold t will be approximately 1 over b to the power of 1 over r. But there are many suitable values of b and r for a given threshold. [][The larger we make b and r, that is, the longer the signatures we use,] [][the closer the s curve will be to a step function.] And therefore, the fewer false positives and negatives we can have. But the longer we make the signatures, the more space they will take and the more work it will be to perform all the minhashing. Then we must run the LSH to get the candidate pairs. For each candidate pair, we examine their signatures and count the number of components in which they agree. That way, we can determine whether the similarity of the signatures really does reach or exceed the threshold t. We can rely on the similarity of the signatures truly measuring the Jaccard similarity of the underlying sets. however, if we want to spend the resources, we can go to the sets themselves after determining that their signatures are sufficiently similar. In some cases, the similarity of the soon con nutrids will overestimate the similarities of the sets that they represent, so it is possible that the two sets are not really similar enough. By computing the Jaccard similarity of the underlying sets, we can eliminate the false positives. Unfortunately, we cannot eliminate false negatives this way. If two sets have Jaccard similarity above threshold, but by bad luck, their signatures never become a candidate pair, then we’ll never look at this pair of signatures or their underlying sets.

[][++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++] [][++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++]

’’’[][*****************************************************************************************************************************] From what I understand/guessing: 1> Bucket num = Column num, and Hash function num = Band num 2> in each Band, generate a Bucket first - which is the column, then hash all its combination columns to that Bucket, keep it if equals (Question, did I wrong?)

” i Exercise 3.4.4 : Suppose we wish to implement LSH by MapReduce. Specifi-cally, assume chunks of the signature matrix consist of columns, and elements are key-value pairs where the key is the column number and the value is the signature itself (i.e., a vector of values). (a) Show how to produce the buckets for all the bands as output of a single MapReduce process. Hint : Remember that a Map function can produce several key-value pairs from a single element. (b) Show how another MapReduce process can convert the output of (a) to a list of pairs that need to be compared. Specifically, for each column i, there should be a list of those columns j > i with which i needs to be compared. ” – Source: Mining Massive Datasets - 3.4.1 LSH for Minhash Signatures

3> Answer: in Map job converting each Element (each Signature matrix row in each Band and each Column) to Bucket, by Hashing r Row in each b Band of each Column to a vector as a Bucket, in Reduce job for each Bucket compare with the Combination Bucket, store Bucket and its Similarities to a List as value and Bucket as Key. (Question, did I wrong?) Map process: [Signature_1, Signature_2, Signature_3, Signature_4, Signature_5] to [Bucket_1], 4> in which Bucket_1 is a string append of all the Elements in each Band of each Column. (Question, did I wrong?)

Reduce process in Band_1: [{Bucket_1: [Bucket_2, Bucket_11, Bucket_28]}, {Bucket_3: [Bucket_11, Bucket_137, Bucket_223, Bucket_342]}, {Bucket_4: [Bucket_5, Bucket_6]}, {Bucket_7: [Bucket_23, Bucket_79]}, {Bucket_8: [Bucket_9, Bucket_10]}, {Bucket_12: [Bucket_35]}, ……] 5> in Band_2 we will not re-process the Buckets that already have/in pair(s), but searching only those never appeared in Key or Value in Band_1. (Question, did I wrong?) ’’’[][*****************************************************************************************************************************]

15. Applications of LSH (11:40)

We’ve see when use of locality sensitive hashing, where the underlying data is a collection of sets and similarity means Jaccard similarity. But there are other approaches that share the same goal of focusing our search on the pairs that are likely to be similar. We’re now going to see some of these variations. Later, we’ll examine general techniques and definitions of similarity other than Jaccard similarity. And we’ll see that LSH is really an approach to problems rather than, than a particular algorithm. [][Our first example of using LSH concerns a problem called entity resolution.] In this kind of problem, we’re given a collection of records. Each record provides some information about an entity typically entities are people but they could be companies, physical locations, events or any number of things. The entity resolution problem is to determine which sets of records refer to the same person and to merge these records into one record that tells everything about that entity. The problem is way more complicated than it looks. for, for example, it is typical that records about people include the name of the person, so it looks like it should be no problem at all to group them into sets that represent the same individual. But in a large collection of records, there will be people with the same name, so grouping by name will merge records for different people and worse, the same person may have their name written in different ways in different records. Some records will have their middle initials and others not. A person’s nickname may appear in one place and their formal name in another, like Sue and Susan. And, of course misspellings occur which makes names look different even if they are intended to be identical. And we often can compensate for these discrepancies by using other information in the records. For example, two records may have similar names, but identical phone numbers or identical addresses. That’s when the problem becomes really interesting.

I’m going to tell you a real story of how I use LSH to get a big consulting fee. After I retired from Stanford I took a job consulting for some lawyers. They were dealing with a lawsuit involving two companies that I will call A and B. Okay. Company B had a service and Company A agreed to use its customer base to find customers for Company B. But the companies took to squabbling and the deal was eventually cancelled. Since B was serving many of the customers that A had sent them, A was owed fees for these customers and sued to get those fees. Unfortunately, neither company had bothered to modify their records to indicate whether a customer had been part of this deal. They could have created a record, we sent this guy to B, and B could have added a bit to their record saying, this guy came from A but neither did. So they had to pay me to figure out how many customers appeared in the databases of both companies. To set the scale of the problem each company had about a million records that might represent the customer that A had provided to B. That’s a tiny database by todays standards, but notice that there are a trillion pairs of records, one from A and one from B, that might be the same person. It is way too expensive to examine and evaluate a trillion pairs of records. Each record from either company, had a name, address, and phone number but often these were different, even for the same person. in, in addition to typos and the sorts of variation in name we discussed earlier there were many other sources of difference. People would move and tell one company their new address but not the other. Area codes would change even though the rest of your phone number remained the same. People would get married and change their name. In all these cases one company might track the change and the other not. [][So our, our first step wa,] [][wa, was to devise a measure of how similar records were.] We gave 100 points each for identical names, addresses, and phone numbers, so 300 was the top score. Interestingly only 7,000 pairs of records received this top score, although we identified over 180 thousand pairs that were very likely the same person. Then we penalize differences in these three fields. Completely different names addresses or phones got zero score but small changes gave scores close to 100. For example if the last names were the same but there was a small spelling difference in the first names. like, like this. Then the score for the name would be 90. If the last names were the same but the first name’s completely different the score for the names would be 50.family member? We scored all candidate pairs of records and reported those pairs that were above a certain threshold as matches. [][One of the subtle points is how we set the threshold without knowing ground truth.] That is, which pairs of records really were created by the same individuals. Notice that this is not a job you can do with machine learning, because there’s no training set available and, we’ll, we’ll talk about how we did this soon. ’Kay so as I mentioned we can’t afford to score all trillion pairs of records. Okay, so I devised a really simple form of locality sensitive hashing to focus on the likely matches. Here we used exactly three hash functions. One had a bucket for each possible name. The second had a bucket for each possible address. And the third had a bucket for each possible phone number. [][Now, the candidate pairs were those placed in the same bucket by] [][at least one of these hash functions.] That is a pair of records was a candidate pair if and only if they agreed exactly in at least one of the three fields. Did we lose some pairs? Surely we did. Because there would be some pairs of records that had small differences in each of the three fields and these would never become candidates for scoring.[][String similarity: DataCamp] We actually did a hand sampling of records and estimate that there were, about 2,500 pairs of records that we missed. But that’s not bad compared with 180,000 that we found. And finding those extra 2,500 would probably have cost more than they were worth to either company.

You may have been puzzled by my remark that we hash to one bucket for each possible name since there are in principle an infinite number of possible names.I’m not think this way [][But we didn’t really hash to buckets, rather we sorted the records by name] [][and then the records with identical names.]COUNT(user_id) Cnt GROUP BY name HAVING COUNT(user_id)>1 Appear consecutively in the list and we can score each pair with identical names. After that we resorted by address and did the same thing with records that had identical addresses. And then finally we repeated the process by sorting, by, by phone number.UNION 2 million records, ORDER BY

We should, we should observe that another approach was to follow the strategy we used when we did LSH for signatures we could hash to say several million buckets and compare all pairs of records within one bucket.combinations? That would sometimes cause us to look at pairs of records with different names that happened to hash to the same bucket but if the number of buckets is much larger than the number of different names that actually appeared in the data then the probability of collisions like this is very low.[][Think here]

Now remember that we scored each candidate pair of records but suppose a pair gets a score like 200 out of 300 indicating a good deal of similarity but not perfect similarity. Do these records represent the same person?You already settled the ground truth as 100??? Well turns out a score of 200 made it very likely that the records represented the same person. [][But how could we tell for sure?] We devised a way to calculate the probability that records with a score x represented the same person. And it’s worth telling about because it can be used in other circumstances as well. Even though the data we used was very specific to the, the problem at hand.

First, remember that there’s a gold standard. 7,000 pairs of identical records that we could assume represented the same person. For these pairs, we looked at the creation dates at companies A and B. It turns out that there was a 10 day lag on average between the time the record was created by company A and the time that the same person went to company B to be, to begin their service. [][On the other hand, in order to reduce further the pairs of records we needed to] score we only looked at pairs of records where the A record was created between,Those combination pairs between zero and 90 days before the B record.So we can reduce trillion combination pairs to a lower number Now if you take a random A record and a random B record, when the A record happens to have been created between zero and 90 days before the B record, you’ll get an average delay of 45 days.How??? These records are almost certain to represent different people because they were chosen at random.

[][So let’s look at a pool of matches, say those with score 200.] Some will be valid matches and their average difference in creation dates will be ten. Others will be false matches and they will have an average difference in creation dates of 45. Suppose that within this pool, the average difference is X. A little math tells you that the fraction of matches that are valid are 45 minus X, all divided by 35.[][Read the book, Read the book, Read the book] So for example, if X equals ten then this fraction is one, which makes sense as a ten is the difference that the gold standard provides. If x equals 20 then we would expect that five-sevenths of the matches are valid. That makes sense five-sevenths of the matches will have an average difference of ten and two-sevenths of them will have an average difference of 45. So the weighted average of the averages is, is 20.

So we tried to convince the lawyers that they should go into court with a claim of a fraction of each of the pools that had average delays less than 45. Even though we couldn’t tell which pairs in each pool were valid and which were not. But the lawyers told us not to even try because no judge or jury would understand the argument, but you understand it, don’t you?[][kind of apple and orange, probability?] Well, while we use the creation date field in records, the idea generalizes to use any field that was not involved in the locality-sensitive hashing. [][All we need to know is that the value in this field will be closer when] [][the records represent the same entity than when they represent different entities.] That should be the case almost always. okay, for a concrete example, suppose records represent individuals and they have a height field. We can assume that if the records represent the same person the average difference in heights will be zero or, or perhaps more precisely, the difference will be the average measurement error which we can determine if we have some gold standard of records that we know represent the same person. This difference substitutes for the difference ten days in our example. But if two records represent different people then the average height difference will be the average difference for random people. We can determine this difference by picking a relatively small number of pairs of records at random and determining the difference in heights of those two records. This difference plays the role of 45 in our example.

16. Fingerprint Matching (7:07)

The next application is that of taking a large collection of fingerprints and finding which pairs are from the same person. Usually, fingerprint analysis is not a many to many problem where we have to find all matches at the same time. Rather, we organize a database of known fingerprints and when a new fingerprint comes in, we tried to match it to those we have seen before. [][However, the LSH technique is still an excellent way to organize the database so] [][that we have to look for matches to the new fingerprint in only a few buckets.] To start, we should know a little about how fingerprints are represented. An image of a fingerprint is examined for what are called minutiae. These are particular locations where something interesting happens to the ridges that form a fingerprint. Examples are where two ridges merge into one or where a ridge ends. So, the image of a fingerprint is replaced by a set of coordinates in the two dimensional space where minutiae are located. You place a grid over each fingerprint image. The grid must be scaled and orientated properly so that if you have two images of the same fingerprint, perhaps one at a different angle or a different size, the grids will overlap. Then you represent each fingerprint by the set of grid squares that contain minutiae. Since some minutiae will be right on or near a boundary, it is useful to regard such minutiae as present in the squares on both sides of the boundary.[][Check the images] So, it looks like we have reduced the problem of finding matching fingerprints to, to the problem of finding similar sets of grid squares that have minutiae. The problem is that the resulting matrix is not sparse. The grid cannot be too fine or it will be unclear where minutiae belong. And as a result, the matrix’s rows are the grid squares and its columns or the fingerprints sets will not be sparse. [][That means min hashing will not work very well.] Each min hash will have relatively few different values, so we don’t get a good distribution into a large number of buckets when we do the LSH. We’re going to have to twist things a little bit to get LSH to work. So before proceeding to the solution here’s a picture of what minutiae look like. This is a case where two ridges merge into one and the entire fingerprint has been overlayed with a grid. It appears the point of merger lies within this grid square. So, we add that square to the set representing the fingerprint. [][However, we might also want to add the squares that are very close to] [][the exact point of merger, because in another image of the same fingerprint,] [][the grid might be shifted slightly to the left or down.] Remember that we represent fingerprints by sets of grid squares, those of minutiae. We could minhash these sets but there is no need to. The universal set is the set of grid squares and the grid is not too fine, so there might be hundreds or at most thousands of squares in the grid.

We can best represent each set by bit vector with one position for each square. The ones represents square with minutiae. And if there, if there are, say, 1,000 grid squares and each bit-vector takes 125Why 125 bites that’s much less space than, say, a vector of 100 integer min hash values. For every LSH, if we pick some member of sets of grid squares or components of the bit-vectors that represent fingerprints. In our example, we’ll use 1,024 sets of three grid squares each, which seems to be a good choice. For each set of three squares, we look at all the prints that have minutiae in each of these three squares. In a sense we are throwing fingerprints into buckets but each set of three squares corresponds to one bucket. And unlike a hash function, a fingerprint can be placed in many buckets. In fact, it would be normal for a print to be placed in several buckets this way. To see why the numbers we proposed makes sense, let’s look at a typical situation. We’ll suppose that approximately 20% of the squares hold minutiae. Also, suppose if two fingerprints represent the same finger, then at least 80 percent of the squares with minutiae from one also have minutiae from the other. The fact that we place minutiae in nearby squares if they are at the boundary helps make this assumption true.[][How???] Let’s see what it takes for the bucket corresponding to a set of three squares to receive two different fingerprints. First, if the fingerprints come from different fingers, then the probability that both prints are placed in this bucket is really tiny. For each finger, each fingerprint has a 20% chance of having minutiae in each of the, of the squares. So the chance of it hitting all three is 0.2 cubed. And for both fingerprints to hit, the probability is the square of that. That is 0.2 to the sixth power, or .000064. Now let’s look at two fingerprints that come from the same finger. The probability of both being in a given bucket is much higher. The reason is that there’s a lot of correlation between the buckets that will contain these prints. To start, for any given grid scare the probability that the first print has some minutia there is 0.2. And given that it does, the probability that the other does as well is 0.8. We need to raise 0.2 times 0.8 to the third power, because there are three squares, each of which need to hold minutiae from both of the prints. The result is about four tenths of 1%. Still really tiny, but 64 times larger than the probability if the prints come from different fingers. [][But remember, we have 1,024 sets of three squares each.] In order for a pair of prints to be a candidate pair, we have only to find them together in one of these 1,024 buckets. The probability of that happening at least once is 98.5%. You can do the math if you like, but there’s an outline on the slide. [][That means there are only 1.5% false negatives.] On the other hand, the same calculation for a pair of fingerprints that comes from different fingers is this, and it gives them a much smaller value of .063. That is, there will be only 6.3% false positives. That’s still quite expensive. It means that 6.3% of all pairs need to be checked for similarity when almost all of them will not be similar. On the other hand, we did reduce our work by a factor of 15. And by using a larger number of sets of squares and perhaps four or five squares per set, we can reduce the false positive rate substantially while still keeping the false negative rate low.

17. Finding Duplicate News Articles (6:08)

Next we shall look at the problem of finding news articles that represent the same story. Since the same story maybe used by many different sources, each with its own way of presenting it on a web page, the problem is not too different from finding similar documents in general. But there is a special way of shingling that works well when the difference are mostly in the ads associated with the article. We shall also talk about a simple bucketing method that works when the number of sets is not too great, and the similarity expected for the underlying articles is quite high. So, we turn to the third interesting variant of LSH with another true story. Awhile ago a group of political scientists at Stanford asked for help from the CS department going through a large repository of news articles to identify those that were essentially duplicates. The problem was that many of the articles really came from the same source, but they can look quite different when published on the website of different news services. They wanted to group web pages whose underlying articles were the same or similar. This by the way, is the same problem faced every day by services such as Google News although they do the grouping day by day rather than once and for all. Identifying the same underlying article is tricky because each news source creates a page from the article that has unique elements on that page. For example, they will put the newspaper’s name and other text elements on the top and there will be links to ads on the page. It’s also common for one site to include links to related or interesting stories on its own site. In addition, it is common for a site not to place the entire article on its pages especially if the article is long. They will leave off the paragraphs at the end or even delete other paragraphs that the editor finds less relevant. Now, the CS team had not heard of locality sensitive hashing or minhashing. However, they invented a form of shingling that is probably better than the standard approach we covered for those webpages that are of the type we just described. And they invented a simple substitute for LSH that worked adequately well for the scale of problem they were looking at. They partitioned the pages into groups of similar length and they only compared pages in the same group or nearby groups. After they had done all this, we happened to be talking in the hall and I mentioned minhashing and LSH. They implemented these algorithms on that data, and they found that minhashing plus LSH was better as long as the similarity threshold was less than 80%. Actually, that is consistent with what is known. When you’re looking for very high Jaccard similarities like 80 or 90% then there are indeed more efficient algorithms, and we’re going to cover these before we leave the topic. Interestingly, the first time they implemented minhashing, they got it wrong and decided that the method was terrible. But the problem was that I forgot to remind them to do the minhashing row by row, where you compute the hash value for each row number once and for all rather than once for each column. Remember that the rows correspond to the shingles and the columns to the web pages. Since their data was naturally stored by columns, that is by web, web pages, they needed to sort their set of shingle webpage pairs to organize them by row, that is, by shingle. Once they did that, they got the positive results I mentioned on the previous slide. Before leaving this topic, let me tell you about the way these guys shingle web pages containing news articles. The key observation was that they needed to give more weight to the articles themselves than to the ads and other elements surrounding the article. That is, they did not want to identify as similar two articles from the same newspaper with the same ads and other elements, but different underlying stories. Their trick was based on stop words, the common little words that we need in order to construct sentences properly, but that do not convey much meaning. Typically the set of stop words includes things like and, the, to and so on. Usually when analyzing text we ignore stop words because they don’t tell us anything about the subject matter. But here the key observation was that the stop words tell us whether we’re looking at the news article from which we want to take our shingles or ads or other peripheral matters from which we do not want to take shingles. For example, ordinary prose would say something like, I recommend that you buy Sudzo for your laundry. The likely stop words are shown in orange. I, that, you, for, your. Very common words. But in an ad we would just find something abbreviated like buy Sudzo, which has no stop word at all. So they defined a shingle to be a stop word followed by the next two words in the sentence, stop words or not. Thus, in the sentence on the slide one shingle would be, I recommend that,the next would be, that you buy and so on. Notice that there are relatively few shingles and it does not guarantee that each word is part of even one shingle. The reason this notion of shingle makes sense is that it biases the set of shingles for a page in favor of the news article. That is, suppose for simplicity that all pages are half news article and half ads, if you count by number of characters. If we have a second page with, with the same article, but different ads, we find that most of the shingles for both pages come from the article, because that’s where the stop words are. So these two pages have almost the same shingles, and therefore have very high Jaccard similarity. Now consider two pages with the same ads and different articles. These will have low Jaccard similarity, because again, most of their shingles come from the articles and these shingles would be mostly different for the two articles